Agregación y mergeo de dataframes#

En esta sección vamos a introducir dis tipo de operaciones muy comunes cuando trabajamos con dataframes, inspiradas en el lenguaje sql: agregaciones con groupby y mergeos con merge. La primera nos permitirá aplicar funciones de agregación a subconjuntos de nuestro dataframe y la segunda hacer que varios dataframes interactúen cuando tienen una o varias claves en común

Agregaciones#

Cargamos un dataset sobre una encuesta realizada a programadores de python

import pandas as pd
import numpy as np

url = "https://github.com/mattharrison/datasets/raw/master/data/2020-jetbrains-python-survey.csv"
jb = pd.read_csv(url)
/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (152) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

Seleccionamos algunas de las columnas y realizamos un pipeline de transformaciones

cols = [
    'age',
    'are.you.datascientist',
    'company.size',
    'country.live',
    'employment.status',
    'first.learn.about.main.ide',
    'how.often.use.main.ide',
    'ide.main',
    'is.python.main',
    'job.team',
    'main.purposes',
    'missing.features.main.ide',
    'nps.main.ide',
    'python.years',
    'python2.version.most',
    'python3.version.most',
    'several.projects',
    'team.size',
    'use.python.most',
    'years.of.coding'
]
jb2 = (jb
    [cols]
    .rename(columns=lambda c: c.replace('.', '_'))
    .assign(
        age=lambda df_: df_.age.str.slice(0,2).astype(float).astype('Int64'),
        are_you_datascientist=lambda df_: df_.are_you_datascientist.replace({'Yes': True, 'No': False, np.nan: False}), 
        company_size=lambda df_: df_.company_size.replace({
            'Just me': 1, 
            'Not sure': np.nan,
            'More than 5,000': 5000,
            '2–10': 2, 
            '11–50':11,
            '51–500': 51, 
            '501–1,000':501,
            '1,001–5,000':1001
        }).astype('Int64'), 
        country_live=lambda df_: df_.country_live.astype('category'), 
        employment_status=lambda df_: df_.employment_status.fillna('Other').astype('category'), 
        is_python_main=lambda df_: df_.is_python_main.astype('category'),
        team_size=lambda df_: df_.team_size.str.split(r'-', n=1, expand=True).iloc[:,0].replace('More than 40 people', 41).where(df_.company_size!=1, 1).astype(float),
        years_of_coding=lambda df_: df_.years_of_coding.replace('Less than 1 year', .5).str.extract(r'(\d+)').astype(float),
        python_years=lambda df_: df_.python_years.replace('Less than 1 year', .5).str.extract(r'(\d+)').astype(float),
        python3_ver=lambda df_: df_.python3_version_most.str.replace('_', '.').str.extract(r'(\d\.\d)').astype(float),
        use_python_most=lambda df_: df_.use_python_most.fillna('Unknown')
    ).drop(columns=['python2_version_most'])
)

En esta serie de transformaciones estamos usando el método assign, que nos permite mediante una serie de argumentos nombrados actualizar/crear columnas de un dataframe. Si se emplean funciones para describir las nuevas columnas, assign utiliza el estado del dataframe en el que se evalúa cada una, de modo que podemos usar columnas que hayan sido actualizadas en argumentos anteriores en lugar de usar varias veces assign.

jb2
age are_you_datascientist company_size country_live employment_status first_learn_about_main_ide how_often_use_main_ide ide_main is_python_main job_team main_purposes missing_features_main_ide nps_main_ide python_years python3_version_most several_projects team_size use_python_most years_of_coding python3_ver
0 30 False 1 NaN Partially employed by a company / organization Conference / User Group Weekly PyCharm Community Edition Yes Work as an external consultant or trainer For work No, it has all the features I need 3.0 3.0 Python 3_7 Yes, I work on many different projects 1.0 Unknown 1.0 3.7
1 21 True 5000 India Fully employed by a company / organization School / University Daily VS Code Yes Work in a team Both for work and personal No, it has all the features I need 8.0 3.0 Python 3_6 Yes, I work on one main and several side projects 2.0 Software prototyping 3.0 3.6
2 30 False 5000 United States Fully employed by a company / organization Friend / Colleague Daily Vim Yes Work on your own project(s) independently Both for work and personal No, it has all the features I need 10.0 3.0 Python 3_6 Yes, I work on one main and several side projects NaN DevOps / System administration / Writing autom... 3.0 3.6
3 <NA> False <NA> NaN Other Friend / Colleague Daily PyCharm Professional Edition Yes NaN Both for work and personal Yes – Please list: 10.0 11.0 Python 3_8 Yes, I work on many different projects 1.0 Web development 11.0 3.8
4 21 False <NA> Italy Student Search engines Daily VS Code Yes Work on your own project(s) independently For personal, educational or side projects No, it has all the features I need 10.0 1.0 Python 3_8 Yes, I work on one main and several side projects 1.0 Web development NaN 3.8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54457 21 False 2 Russian Federation Fully employed by a company / organization School / University Daily Vim Yes Work on your own project(s) independently Both for work and personal No, it has all the features I need 10.0 6.0 Python 3_6 Yes, I work on many different projects NaN Data analysis 1.0 3.6
54458 <NA> False <NA> NaN Other NaN NaN NaN Yes NaN Both for work and personal NaN NaN 3.0 Python 3_7 NaN 1.0 Web development 1.0 3.7
54459 21 False 1 Russian Federation Self-employed (a person earning income directl... Friend / Colleague Daily PyCharm Professional Edition Yes Work in a team Both for work and personal No, it has all the features I need 10.0 3.0 Python 3_7 Yes, I work on many different projects 1.0 Web development 6.0 3.7
54460 30 True 51 Spain Fully employed by a company / organization Search engines Daily Other Yes Work on your own project(s) independently Both for work and personal Yes – Please list: 3.0 6.0 Python 3_7 Yes, I work on many different projects NaN Data analysis 3.0 3.7
54461 21 False 11 Algeria Fully employed by a company / organization Online learning platform / Online course Daily VS Code Yes Work in a team Both for work and personal No, it has all the features I need 10.0 1.0 Python 3_8 Yes, I work on many different projects 2.0 Unknown 1.0 3.8

54462 rows × 20 columns

Una de las formas que existen para realizar agregaciones en pandas es mediante el objeto DataFrameGroupBy, seguido siempre de una selección de columna(s) y una o varias funciones de agregación.

gb = jb2.groupby(["country_live"])
type(gb)
pandas.core.groupby.generic.DataFrameGroupBy
gb["age"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fab4deee2b0>

Podemos aplicar los métodos de agregación que hemos visto para numpy y pandas a los objetos de tipo _GroupBy

gb["age"].mean()
country_live
Algeria          27.528736
Argentina         29.65641
Armenia          25.967742
Australia        32.527578
Austria          29.914894
                   ...    
United States    32.023162
Uruguay           29.30303
Uzbekistan       23.290323
Venezuela        27.909091
Viet Nam         24.421053
Name: age, Length: 76, dtype: Float64

Podemos agregar usando varias columnas

s = (
    jb2
    .groupby(["country_live", "employment_status"])
    .age
    .mean()
)
s
country_live  employment_status                                                                             
Algeria       Freelancer (a person pursuing a profession without a long-term commitment to any one employer)    33.285714
              Fully employed by a company / organization                                                        28.344828
              Other                                                                                             43.333333
              Partially employed by a company / organization                                                    25.181818
              Retired                                                                                                60.0
                                                                                                                  ...    
Viet Nam      Partially employed by a company / organization                                                         23.4
              Retired                                                                                                <NA>
              Self-employed (a person earning income directly from one's own business, trade, or profession)         35.5
              Student                                                                                           19.384615
              Working student                                                                                   24.571429
Name: age, Length: 608, dtype: Float64

Para convertir el último nivel del índice en una columnas usamos el método unstack

s.unstack()
employment_status Freelancer (a person pursuing a profession without a long-term commitment to any one employer) Fully employed by a company / organization Other Partially employed by a company / organization Retired Self-employed (a person earning income directly from one's own business, trade, or profession) Student Working student
country_live
Algeria 33.285714 28.344828 43.333333 25.181818 60.0 28.411765 20.4375 21.0
Argentina 34.954545 30.209524 40.0 30.25 55.0 29.571429 22.913043 23.2
Armenia 30.0 23.842105 60.0 24.0 40.0 24.0 <NA> <NA>
Australia 30.818182 33.074468 33.1 32.0 48.285714 39.21875 24.828571 22.470588
Austria 35.857143 31.318182 30.0 29.0 60.0 34.615385 20.842105 23.8125
... ... ... ... ... ... ... ... ...
United States 34.318471 32.343223 35.534884 25.894737 54.920635 38.446512 21.947791 22.89375
Uruguay 30.0 28.909091 <NA> 36.2 <NA> 30.0 20.0 <NA>
Uzbekistan 36.0 22.125 29.0 21.0 <NA> 30.5 19.0 21.0
Venezuela 29.888889 27.826087 30.5 26.8 55.0 28.75 20.454545 28.833333
Viet Nam 28.8 23.025641 60.0 23.4 <NA> 35.5 19.384615 24.571429

76 rows × 8 columns

También podemos usar una función de agregación definida por nosotros usando el método agg

def per_emacs(s):
    return s.str.contains("Emacs").sum() / len(s) * 100
(
    jb2
    .groupby("country_live")
    .ide_main
    .agg(per_emacs)
)
country_live
Algeria          0.000000
Argentina        4.102564
Armenia          0.000000
Australia        2.877698
Austria          1.595745
                   ...   
United States    4.025157
Uruguay          0.000000
Uzbekistan       0.000000
Venezuela        0.000000
Viet Nam         0.000000
Name: ide_main, Length: 76, dtype: float64

Podemos realizar agregaciones múltiples

(
    jb2
    .groupby("country_live")
    .age
    .agg([min, max])
)
min max
country_live
Algeria 18 60
Argentina 18 60
Armenia 18 60
Australia 18 60
Austria 18 60
... ... ...
United States 18 60
Uruguay 18 60
Uzbekistan 18 60
Venezuela 18 60
Viet Nam 18 60

76 rows × 2 columns

Finalmente, también se puede hacer múltiples agregaciones para cada columna

(
    jb2
    .groupby("country_live")
    .agg({
        "age": ["min", "max"], 
        "team_size": "mean"
    })
)
age team_size
min max mean
country_live
Algeria 18 60 2.428571
Argentina 18 60 3.192053
Armenia 18 60 6.076923
Australia 18 60 2.710884
Austria 18 60 2.448000
... ... ... ...
United States 18 60 3.391337
Uruguay 18 60 4.692308
Uzbekistan 18 60 1.160000
Venezuela 18 60 1.812500
Viet Nam 18 60 3.285714

76 rows × 3 columns

combo final:

(
    jb2
    .groupby(["country_live", "employment_status"])
    .agg({
        "age": ["min", "max"], 
        "team_size": "mean"
    })
)
age team_size
min max mean
country_live employment_status
Algeria Freelancer (a person pursuing a profession without a long-term commitment to any one employer) 21 60 1.000000
Fully employed by a company / organization 18 60 2.750000
Other 30 50 1.000000
Partially employed by a company / organization 18 40 5.833333
Retired 60 60 1.000000
... ... ... ... ...
Viet Nam Partially employed by a company / organization 18 30 5.666667
Retired <NA> <NA> NaN
Self-employed (a person earning income directly from one's own business, trade, or profession) 21 60 1.000000
Student 18 21 1.000000
Working student 18 40 1.333333

608 rows × 3 columns

Mergeo#

Hasta ahora hemos tratado los dataframes de forma independiente, pero vamos a ver a continuación como juntas varios dataframes de diferentes modos

df1 = pd.DataFrame(
    {
        'name': ['John', 'George', 'Ringo'],
        'color': ['Blue', 'Blue', 'Purple']
    }
) 
df2 = pd.DataFrame(
    {
        'name': ['Paul', 'George', 'Ringo'],
        'carcolor': ['Red', 'Blue', np.nan]
    }, 
    index=[3, 1, 2]
)
df1
name color
0 John Blue
1 George Blue
2 Ringo Purple
df2
name carcolor
3 Paul Red
1 George Blue
2 Ringo NaN

El método merge realiza un inner join por defecto

df1.merge(df2)
name color carcolor
0 George Blue Blue
1 Ringo Purple NaN

Pero existen otros tipos de merge utilizados usualmente al gestionar en bases de datos

df1.merge(df2, how="outer")
name color carcolor
0 John Blue NaN
1 George Blue Blue
2 Ringo Purple NaN
3 Paul NaN Red
df1.merge(df2, how="left")
name color carcolor
0 John Blue NaN
1 George Blue Blue
2 Ringo Purple NaN

Por defecto, pandas cruza por aquellas columans que tengan el mismo nombre. Sin embargo a veces tenemos que pasarle el nombre de las columas, para ello tenemos los atributos on, left_on y right_on

df1.merge(df2, how="right", left_on="color", right_on="carcolor")
name_x color name_y carcolor
0 NaN NaN Paul Red
1 John Blue George Blue
2 George Blue George Blue
3 NaN NaN Ringo NaN

Pandas incorpora también el método join, que es similar a merge pero opera con los ínidices de los dataframes, no las columnas.