Agregación y mergeo de dataframes

Contenido

Agregación y mergeo de dataframes#

En esta sección vamos a introducir dis tipo de operaciones muy comunes cuando trabajamos con dataframes, inspiradas en el lenguaje sql: agregaciones con groupby y mergeos con merge. La primera nos permitirá aplicar funciones de agregación a subconjuntos de nuestro dataframe y la segunda hacer que varios dataframes interactúen cuando tienen una o varias claves en común

Agregaciones#

Cargamos un dataset sobre una encuesta realizada a programadores de python

import pandas as pd
import numpy as np

url = "https://github.com/mattharrison/datasets/raw/master/data/2020-jetbrains-python-survey.csv"
jb = pd.read_csv(url)

/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (152) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

Seleccionamos algunas de las columnas y realizamos un pipeline de transformaciones

cols = [
    'age',
    'are.you.datascientist',
    'company.size',
    'country.live',
    'employment.status',
    'first.learn.about.main.ide',
    'how.often.use.main.ide',
    'ide.main',
    'is.python.main',
    'job.team',
    'main.purposes',
    'missing.features.main.ide',
    'nps.main.ide',
    'python.years',
    'python2.version.most',
    'python3.version.most',
    'several.projects',
    'team.size',
    'use.python.most',
    'years.of.coding'
]

jb2 = (jb
    [cols]
    .rename(columns=lambda c: c.replace('.', '_'))
    .assign(
        age=lambda df_: df_.age.str.slice(0,2).astype(float).astype('Int64'),
        are_you_datascientist=lambda df_: df_.are_you_datascientist.replace({'Yes': True, 'No': False, np.nan: False}), 
        company_size=lambda df_: df_.company_size.replace({
            'Just me': 1, 
            'Not sure': np.nan,
            'More than 5,000': 5000,
            '2–10': 2, 
            '11–50':11,
            '51–500': 51, 
            '501–1,000':501,
            '1,001–5,000':1001
        }).astype('Int64'), 
        country_live=lambda df_: df_.country_live.astype('category'), 
        employment_status=lambda df_: df_.employment_status.fillna('Other').astype('category'), 
        is_python_main=lambda df_: df_.is_python_main.astype('category'),
        team_size=lambda df_: df_.team_size.str.split(r'-', n=1, expand=True).iloc[:,0].replace('More than 40 people', 41).where(df_.company_size!=1, 1).astype(float),
        years_of_coding=lambda df_: df_.years_of_coding.replace('Less than 1 year', .5).str.extract(r'(\d+)').astype(float),
        python_years=lambda df_: df_.python_years.replace('Less than 1 year', .5).str.extract(r'(\d+)').astype(float),
        python3_ver=lambda df_: df_.python3_version_most.str.replace('_', '.').str.extract(r'(\d\.\d)').astype(float),
        use_python_most=lambda df_: df_.use_python_most.fillna('Unknown')
    ).drop(columns=['python2_version_most'])
)

En esta serie de transformaciones estamos usando el método assign, que nos permite mediante una serie de argumentos nombrados actualizar/crear columnas de un dataframe. Si se emplean funciones para describir las nuevas columnas, assign utiliza el estado del dataframe en el que se evalúa cada una, de modo que podemos usar columnas que hayan sido actualizadas en argumentos anteriores en lugar de usar varias veces assign.

jb2

	age	are_you_datascientist	company_size	country_live	employment_status	first_learn_about_main_ide	how_often_use_main_ide	ide_main	is_python_main	job_team	main_purposes	missing_features_main_ide	nps_main_ide	python_years	python3_version_most	several_projects	team_size	use_python_most	years_of_coding	python3_ver
0	30	False	1	NaN	Partially employed by a company / organization	Conference / User Group	Weekly	PyCharm Community Edition	Yes	Work as an external consultant or trainer	For work	No, it has all the features I need	3.0	3.0	Python 3_7	Yes, I work on many different projects	1.0	Unknown	1.0	3.7
1	21	True	5000	India	Fully employed by a company / organization	School / University	Daily	VS Code	Yes	Work in a team	Both for work and personal	No, it has all the features I need	8.0	3.0	Python 3_6	Yes, I work on one main and several side projects	2.0	Software prototyping	3.0	3.6
2	30	False	5000	United States	Fully employed by a company / organization	Friend / Colleague	Daily	Vim	Yes	Work on your own project(s) independently	Both for work and personal	No, it has all the features I need	10.0	3.0	Python 3_6	Yes, I work on one main and several side projects	NaN	DevOps / System administration / Writing autom...	3.0	3.6
3	<NA>	False	<NA>	NaN	Other	Friend / Colleague	Daily	PyCharm Professional Edition	Yes	NaN	Both for work and personal	Yes – Please list:	10.0	11.0	Python 3_8	Yes, I work on many different projects	1.0	Web development	11.0	3.8
4	21	False	<NA>	Italy	Student	Search engines	Daily	VS Code	Yes	Work on your own project(s) independently	For personal, educational or side projects	No, it has all the features I need	10.0	1.0	Python 3_8	Yes, I work on one main and several side projects	1.0	Web development	NaN	3.8
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54457	21	False	2	Russian Federation	Fully employed by a company / organization	School / University	Daily	Vim	Yes	Work on your own project(s) independently	Both for work and personal	No, it has all the features I need	10.0	6.0	Python 3_6	Yes, I work on many different projects	NaN	Data analysis	1.0	3.6
54458	<NA>	False	<NA>	NaN	Other	NaN	NaN	NaN	Yes	NaN	Both for work and personal	NaN	NaN	3.0	Python 3_7	NaN	1.0	Web development	1.0	3.7
54459	21	False	1	Russian Federation	Self-employed (a person earning income directl...	Friend / Colleague	Daily	PyCharm Professional Edition	Yes	Work in a team	Both for work and personal	No, it has all the features I need	10.0	3.0	Python 3_7	Yes, I work on many different projects	1.0	Web development	6.0	3.7
54460	30	True	51	Spain	Fully employed by a company / organization	Search engines	Daily	Other	Yes	Work on your own project(s) independently	Both for work and personal	Yes – Please list:	3.0	6.0	Python 3_7	Yes, I work on many different projects	NaN	Data analysis	3.0	3.7
54461	21	False	11	Algeria	Fully employed by a company / organization	Online learning platform / Online course	Daily	VS Code	Yes	Work in a team	Both for work and personal	No, it has all the features I need	10.0	1.0	Python 3_8	Yes, I work on many different projects	2.0	Unknown	1.0	3.8

54462 rows × 20 columns

Una de las formas que existen para realizar agregaciones en pandas es mediante el objeto DataFrameGroupBy, seguido siempre de una selección de columna(s) y una o varias funciones de agregación.

gb = jb2.groupby(["country_live"])

type(gb)

pandas.core.groupby.generic.DataFrameGroupBy

gb["age"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fab4deee2b0>

Podemos aplicar los métodos de agregación que hemos visto para numpy y pandas a los objetos de tipo _GroupBy

gb["age"].mean()

country_live
Algeria          27.528736
Argentina         29.65641
Armenia          25.967742
Australia        32.527578
Austria          29.914894
                   ...    
United States    32.023162
Uruguay           29.30303
Uzbekistan       23.290323
Venezuela        27.909091
Viet Nam         24.421053
Name: age, Length: 76, dtype: Float64

Podemos agregar usando varias columnas

s = (
    jb2
    .groupby(["country_live", "employment_status"])
    .age
    .mean()
)
s

country_live  employment_status                                                                             
Algeria       Freelancer (a person pursuing a profession without a long-term commitment to any one employer)    33.285714
              Fully employed by a company / organization                                                        28.344828
              Other                                                                                             43.333333
              Partially employed by a company / organization                                                    25.181818
              Retired                                                                                                60.0
                                                                                                                  ...    
Viet Nam      Partially employed by a company / organization                                                         23.4
              Retired                                                                                                <NA>
              Self-employed (a person earning income directly from one's own business, trade, or profession)         35.5
              Student                                                                                           19.384615
              Working student                                                                                   24.571429
Name: age, Length: 608, dtype: Float64

Para convertir el último nivel del índice en una columnas usamos el método unstack

s.unstack()

employment_status	Freelancer (a person pursuing a profession without a long-term commitment to any one employer)	Fully employed by a company / organization	Other	Partially employed by a company / organization	Retired	Self-employed (a person earning income directly from one's own business, trade, or profession)	Student	Working student
country_live
Algeria	33.285714	28.344828	43.333333	25.181818	60.0	28.411765	20.4375	21.0
Argentina	34.954545	30.209524	40.0	30.25	55.0	29.571429	22.913043	23.2
Armenia	30.0	23.842105	60.0	24.0	40.0	24.0	<NA>	<NA>
Australia	30.818182	33.074468	33.1	32.0	48.285714	39.21875	24.828571	22.470588
Austria	35.857143	31.318182	30.0	29.0	60.0	34.615385	20.842105	23.8125
...	...	...	...	...	...	...	...	...
United States	34.318471	32.343223	35.534884	25.894737	54.920635	38.446512	21.947791	22.89375
Uruguay	30.0	28.909091	<NA>	36.2	<NA>	30.0	20.0	<NA>
Uzbekistan	36.0	22.125	29.0	21.0	<NA>	30.5	19.0	21.0
Venezuela	29.888889	27.826087	30.5	26.8	55.0	28.75	20.454545	28.833333
Viet Nam	28.8	23.025641	60.0	23.4	<NA>	35.5	19.384615	24.571429

76 rows × 8 columns

También podemos usar una función de agregación definida por nosotros usando el método agg

def per_emacs(s):
    return s.str.contains("Emacs").sum() / len(s) * 100

(
    jb2
    .groupby("country_live")
    .ide_main
    .agg(per_emacs)
)

country_live
Algeria          0.000000
Argentina        4.102564
Armenia          0.000000
Australia        2.877698
Austria          1.595745
                   ...   
United States    4.025157
Uruguay          0.000000
Uzbekistan       0.000000
Venezuela        0.000000
Viet Nam         0.000000
Name: ide_main, Length: 76, dtype: float64

Podemos realizar agregaciones múltiples

(
    jb2
    .groupby("country_live")
    .age
    .agg([min, max])
)

	min	max
country_live
Algeria	18	60
Argentina	18	60
Armenia	18	60
Australia	18	60
Austria	18	60
...	...	...
United States	18	60
Uruguay	18	60
Uzbekistan	18	60
Venezuela	18	60
Viet Nam	18	60

76 rows × 2 columns

Finalmente, también se puede hacer múltiples agregaciones para cada columna

(
    jb2
    .groupby("country_live")
    .agg({
        "age": ["min", "max"], 
        "team_size": "mean"
    })
)

	age		team_size
	min	max	mean
country_live
Algeria	18	60	2.428571
Argentina	18	60	3.192053
Armenia	18	60	6.076923
Australia	18	60	2.710884
Austria	18	60	2.448000
...	...	...	...
United States	18	60	3.391337
Uruguay	18	60	4.692308
Uzbekistan	18	60	1.160000
Venezuela	18	60	1.812500
Viet Nam	18	60	3.285714

76 rows × 3 columns

combo final:

(
    jb2
    .groupby(["country_live", "employment_status"])
    .agg({
        "age": ["min", "max"], 
        "team_size": "mean"
    })
)

		age		team_size
		min	max	mean
country_live	employment_status
Algeria	Freelancer (a person pursuing a profession without a long-term commitment to any one employer)	21	60	1.000000
	Fully employed by a company / organization	18	60	2.750000
	Other	30	50	1.000000
	Partially employed by a company / organization	18	40	5.833333
	Retired	60	60	1.000000
...	...	...	...	...
Viet Nam	Partially employed by a company / organization	18	30	5.666667
	Retired	<NA>	<NA>	NaN
	Self-employed (a person earning income directly from one's own business, trade, or profession)	21	60	1.000000
	Student	18	21	1.000000
	Working student	18	40	1.333333

608 rows × 3 columns

Mergeo#

Hasta ahora hemos tratado los dataframes de forma independiente, pero vamos a ver a continuación como juntas varios dataframes de diferentes modos

df1 = pd.DataFrame(
    {
        'name': ['John', 'George', 'Ringo'],
        'color': ['Blue', 'Blue', 'Purple']
    }
) 
df2 = pd.DataFrame(
    {
        'name': ['Paul', 'George', 'Ringo'],
        'carcolor': ['Red', 'Blue', np.nan]
    }, 
    index=[3, 1, 2]
)

df1

	name	color
0	John	Blue
1	George	Blue
2	Ringo	Purple

df2

	name	carcolor
3	Paul	Red
1	George	Blue
2	Ringo	NaN

El método merge realiza un inner join por defecto

df1.merge(df2)

	name	color	carcolor
0	George	Blue	Blue
1	Ringo	Purple	NaN

Pero existen otros tipos de merge utilizados usualmente al gestionar en bases de datos

df1.merge(df2, how="outer")

	name	color	carcolor
0	John	Blue	NaN
1	George	Blue	Blue
2	Ringo	Purple	NaN
3	Paul	NaN	Red

df1.merge(df2, how="left")

	name	color	carcolor
0	John	Blue	NaN
1	George	Blue	Blue
2	Ringo	Purple	NaN

Por defecto, pandas cruza por aquellas columans que tengan el mismo nombre. Sin embargo a veces tenemos que pasarle el nombre de las columas, para ello tenemos los atributos on, left_on y right_on

df1.merge(df2, how="right", left_on="color", right_on="carcolor")

	name_x	color	name_y	carcolor
0	NaN	NaN	Paul	Red
1	John	Blue	George	Blue
2	George	Blue	George	Blue
3	NaN	NaN	Ringo	NaN

Pandas incorpora también el método join, que es similar a merge pero opera con los ínidices de los dataframes, no las columnas.