Agregación y mergeo de dataframes#
En esta sección vamos a introducir dis tipo de operaciones muy comunes cuando trabajamos con dataframes, inspiradas en el lenguaje sql: agregaciones con groupby
y mergeos con merge
. La primera nos permitirá aplicar funciones de agregación a subconjuntos de nuestro dataframe y la segunda hacer que varios dataframes interactúen cuando tienen una o varias claves en común
Agregaciones#
Cargamos un dataset sobre una encuesta realizada a programadores de python
import pandas as pd
import numpy as np
url = "https://github.com/mattharrison/datasets/raw/master/data/2020-jetbrains-python-survey.csv"
jb = pd.read_csv(url)
/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning: Columns (152) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Seleccionamos algunas de las columnas y realizamos un pipeline de transformaciones
cols = [
'age',
'are.you.datascientist',
'company.size',
'country.live',
'employment.status',
'first.learn.about.main.ide',
'how.often.use.main.ide',
'ide.main',
'is.python.main',
'job.team',
'main.purposes',
'missing.features.main.ide',
'nps.main.ide',
'python.years',
'python2.version.most',
'python3.version.most',
'several.projects',
'team.size',
'use.python.most',
'years.of.coding'
]
jb2 = (jb
[cols]
.rename(columns=lambda c: c.replace('.', '_'))
.assign(
age=lambda df_: df_.age.str.slice(0,2).astype(float).astype('Int64'),
are_you_datascientist=lambda df_: df_.are_you_datascientist.replace({'Yes': True, 'No': False, np.nan: False}),
company_size=lambda df_: df_.company_size.replace({
'Just me': 1,
'Not sure': np.nan,
'More than 5,000': 5000,
'2–10': 2,
'11–50':11,
'51–500': 51,
'501–1,000':501,
'1,001–5,000':1001
}).astype('Int64'),
country_live=lambda df_: df_.country_live.astype('category'),
employment_status=lambda df_: df_.employment_status.fillna('Other').astype('category'),
is_python_main=lambda df_: df_.is_python_main.astype('category'),
team_size=lambda df_: df_.team_size.str.split(r'-', n=1, expand=True).iloc[:,0].replace('More than 40 people', 41).where(df_.company_size!=1, 1).astype(float),
years_of_coding=lambda df_: df_.years_of_coding.replace('Less than 1 year', .5).str.extract(r'(\d+)').astype(float),
python_years=lambda df_: df_.python_years.replace('Less than 1 year', .5).str.extract(r'(\d+)').astype(float),
python3_ver=lambda df_: df_.python3_version_most.str.replace('_', '.').str.extract(r'(\d\.\d)').astype(float),
use_python_most=lambda df_: df_.use_python_most.fillna('Unknown')
).drop(columns=['python2_version_most'])
)
En esta serie de transformaciones estamos usando el método assign
, que nos permite mediante una serie de argumentos nombrados actualizar/crear columnas de un dataframe. Si se emplean funciones para describir las nuevas columnas, assign
utiliza el estado del dataframe en el que se evalúa cada una, de modo que podemos usar columnas que hayan sido actualizadas en argumentos anteriores en lugar de usar varias veces assign
.
jb2
age | are_you_datascientist | company_size | country_live | employment_status | first_learn_about_main_ide | how_often_use_main_ide | ide_main | is_python_main | job_team | main_purposes | missing_features_main_ide | nps_main_ide | python_years | python3_version_most | several_projects | team_size | use_python_most | years_of_coding | python3_ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | False | 1 | NaN | Partially employed by a company / organization | Conference / User Group | Weekly | PyCharm Community Edition | Yes | Work as an external consultant or trainer | For work | No, it has all the features I need | 3.0 | 3.0 | Python 3_7 | Yes, I work on many different projects | 1.0 | Unknown | 1.0 | 3.7 |
1 | 21 | True | 5000 | India | Fully employed by a company / organization | School / University | Daily | VS Code | Yes | Work in a team | Both for work and personal | No, it has all the features I need | 8.0 | 3.0 | Python 3_6 | Yes, I work on one main and several side projects | 2.0 | Software prototyping | 3.0 | 3.6 |
2 | 30 | False | 5000 | United States | Fully employed by a company / organization | Friend / Colleague | Daily | Vim | Yes | Work on your own project(s) independently | Both for work and personal | No, it has all the features I need | 10.0 | 3.0 | Python 3_6 | Yes, I work on one main and several side projects | NaN | DevOps / System administration / Writing autom... | 3.0 | 3.6 |
3 | <NA> | False | <NA> | NaN | Other | Friend / Colleague | Daily | PyCharm Professional Edition | Yes | NaN | Both for work and personal | Yes – Please list: | 10.0 | 11.0 | Python 3_8 | Yes, I work on many different projects | 1.0 | Web development | 11.0 | 3.8 |
4 | 21 | False | <NA> | Italy | Student | Search engines | Daily | VS Code | Yes | Work on your own project(s) independently | For personal, educational or side projects | No, it has all the features I need | 10.0 | 1.0 | Python 3_8 | Yes, I work on one main and several side projects | 1.0 | Web development | NaN | 3.8 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54457 | 21 | False | 2 | Russian Federation | Fully employed by a company / organization | School / University | Daily | Vim | Yes | Work on your own project(s) independently | Both for work and personal | No, it has all the features I need | 10.0 | 6.0 | Python 3_6 | Yes, I work on many different projects | NaN | Data analysis | 1.0 | 3.6 |
54458 | <NA> | False | <NA> | NaN | Other | NaN | NaN | NaN | Yes | NaN | Both for work and personal | NaN | NaN | 3.0 | Python 3_7 | NaN | 1.0 | Web development | 1.0 | 3.7 |
54459 | 21 | False | 1 | Russian Federation | Self-employed (a person earning income directl... | Friend / Colleague | Daily | PyCharm Professional Edition | Yes | Work in a team | Both for work and personal | No, it has all the features I need | 10.0 | 3.0 | Python 3_7 | Yes, I work on many different projects | 1.0 | Web development | 6.0 | 3.7 |
54460 | 30 | True | 51 | Spain | Fully employed by a company / organization | Search engines | Daily | Other | Yes | Work on your own project(s) independently | Both for work and personal | Yes – Please list: | 3.0 | 6.0 | Python 3_7 | Yes, I work on many different projects | NaN | Data analysis | 3.0 | 3.7 |
54461 | 21 | False | 11 | Algeria | Fully employed by a company / organization | Online learning platform / Online course | Daily | VS Code | Yes | Work in a team | Both for work and personal | No, it has all the features I need | 10.0 | 1.0 | Python 3_8 | Yes, I work on many different projects | 2.0 | Unknown | 1.0 | 3.8 |
54462 rows × 20 columns
Una de las formas que existen para realizar agregaciones en pandas es mediante el objeto DataFrameGroupBy
, seguido siempre de una selección de columna(s) y una o varias funciones de agregación.
gb = jb2.groupby(["country_live"])
type(gb)
pandas.core.groupby.generic.DataFrameGroupBy
gb["age"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fab4deee2b0>
Podemos aplicar los métodos de agregación que hemos visto para numpy y pandas a los objetos de tipo _GroupBy
gb["age"].mean()
country_live
Algeria 27.528736
Argentina 29.65641
Armenia 25.967742
Australia 32.527578
Austria 29.914894
...
United States 32.023162
Uruguay 29.30303
Uzbekistan 23.290323
Venezuela 27.909091
Viet Nam 24.421053
Name: age, Length: 76, dtype: Float64
Podemos agregar usando varias columnas
s = (
jb2
.groupby(["country_live", "employment_status"])
.age
.mean()
)
s
country_live employment_status
Algeria Freelancer (a person pursuing a profession without a long-term commitment to any one employer) 33.285714
Fully employed by a company / organization 28.344828
Other 43.333333
Partially employed by a company / organization 25.181818
Retired 60.0
...
Viet Nam Partially employed by a company / organization 23.4
Retired <NA>
Self-employed (a person earning income directly from one's own business, trade, or profession) 35.5
Student 19.384615
Working student 24.571429
Name: age, Length: 608, dtype: Float64
Para convertir el último nivel del índice en una columnas usamos el método unstack
s.unstack()
employment_status | Freelancer (a person pursuing a profession without a long-term commitment to any one employer) | Fully employed by a company / organization | Other | Partially employed by a company / organization | Retired | Self-employed (a person earning income directly from one's own business, trade, or profession) | Student | Working student |
---|---|---|---|---|---|---|---|---|
country_live | ||||||||
Algeria | 33.285714 | 28.344828 | 43.333333 | 25.181818 | 60.0 | 28.411765 | 20.4375 | 21.0 |
Argentina | 34.954545 | 30.209524 | 40.0 | 30.25 | 55.0 | 29.571429 | 22.913043 | 23.2 |
Armenia | 30.0 | 23.842105 | 60.0 | 24.0 | 40.0 | 24.0 | <NA> | <NA> |
Australia | 30.818182 | 33.074468 | 33.1 | 32.0 | 48.285714 | 39.21875 | 24.828571 | 22.470588 |
Austria | 35.857143 | 31.318182 | 30.0 | 29.0 | 60.0 | 34.615385 | 20.842105 | 23.8125 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
United States | 34.318471 | 32.343223 | 35.534884 | 25.894737 | 54.920635 | 38.446512 | 21.947791 | 22.89375 |
Uruguay | 30.0 | 28.909091 | <NA> | 36.2 | <NA> | 30.0 | 20.0 | <NA> |
Uzbekistan | 36.0 | 22.125 | 29.0 | 21.0 | <NA> | 30.5 | 19.0 | 21.0 |
Venezuela | 29.888889 | 27.826087 | 30.5 | 26.8 | 55.0 | 28.75 | 20.454545 | 28.833333 |
Viet Nam | 28.8 | 23.025641 | 60.0 | 23.4 | <NA> | 35.5 | 19.384615 | 24.571429 |
76 rows × 8 columns
También podemos usar una función de agregación definida por nosotros usando el método agg
def per_emacs(s):
return s.str.contains("Emacs").sum() / len(s) * 100
(
jb2
.groupby("country_live")
.ide_main
.agg(per_emacs)
)
country_live
Algeria 0.000000
Argentina 4.102564
Armenia 0.000000
Australia 2.877698
Austria 1.595745
...
United States 4.025157
Uruguay 0.000000
Uzbekistan 0.000000
Venezuela 0.000000
Viet Nam 0.000000
Name: ide_main, Length: 76, dtype: float64
Podemos realizar agregaciones múltiples
(
jb2
.groupby("country_live")
.age
.agg([min, max])
)
min | max | |
---|---|---|
country_live | ||
Algeria | 18 | 60 |
Argentina | 18 | 60 |
Armenia | 18 | 60 |
Australia | 18 | 60 |
Austria | 18 | 60 |
... | ... | ... |
United States | 18 | 60 |
Uruguay | 18 | 60 |
Uzbekistan | 18 | 60 |
Venezuela | 18 | 60 |
Viet Nam | 18 | 60 |
76 rows × 2 columns
Finalmente, también se puede hacer múltiples agregaciones para cada columna
(
jb2
.groupby("country_live")
.agg({
"age": ["min", "max"],
"team_size": "mean"
})
)
age | team_size | ||
---|---|---|---|
min | max | mean | |
country_live | |||
Algeria | 18 | 60 | 2.428571 |
Argentina | 18 | 60 | 3.192053 |
Armenia | 18 | 60 | 6.076923 |
Australia | 18 | 60 | 2.710884 |
Austria | 18 | 60 | 2.448000 |
... | ... | ... | ... |
United States | 18 | 60 | 3.391337 |
Uruguay | 18 | 60 | 4.692308 |
Uzbekistan | 18 | 60 | 1.160000 |
Venezuela | 18 | 60 | 1.812500 |
Viet Nam | 18 | 60 | 3.285714 |
76 rows × 3 columns
combo final:
(
jb2
.groupby(["country_live", "employment_status"])
.agg({
"age": ["min", "max"],
"team_size": "mean"
})
)
age | team_size | |||
---|---|---|---|---|
min | max | mean | ||
country_live | employment_status | |||
Algeria | Freelancer (a person pursuing a profession without a long-term commitment to any one employer) | 21 | 60 | 1.000000 |
Fully employed by a company / organization | 18 | 60 | 2.750000 | |
Other | 30 | 50 | 1.000000 | |
Partially employed by a company / organization | 18 | 40 | 5.833333 | |
Retired | 60 | 60 | 1.000000 | |
... | ... | ... | ... | ... |
Viet Nam | Partially employed by a company / organization | 18 | 30 | 5.666667 |
Retired | <NA> | <NA> | NaN | |
Self-employed (a person earning income directly from one's own business, trade, or profession) | 21 | 60 | 1.000000 | |
Student | 18 | 21 | 1.000000 | |
Working student | 18 | 40 | 1.333333 |
608 rows × 3 columns
Mergeo#
Hasta ahora hemos tratado los dataframes de forma independiente, pero vamos a ver a continuación como juntas varios dataframes de diferentes modos
df1 = pd.DataFrame(
{
'name': ['John', 'George', 'Ringo'],
'color': ['Blue', 'Blue', 'Purple']
}
)
df2 = pd.DataFrame(
{
'name': ['Paul', 'George', 'Ringo'],
'carcolor': ['Red', 'Blue', np.nan]
},
index=[3, 1, 2]
)
df1
name | color | |
---|---|---|
0 | John | Blue |
1 | George | Blue |
2 | Ringo | Purple |
df2
name | carcolor | |
---|---|---|
3 | Paul | Red |
1 | George | Blue |
2 | Ringo | NaN |
El método merge
realiza un inner join por defecto
df1.merge(df2)
name | color | carcolor | |
---|---|---|---|
0 | George | Blue | Blue |
1 | Ringo | Purple | NaN |
Pero existen otros tipos de merge utilizados usualmente al gestionar en bases de datos
df1.merge(df2, how="outer")
name | color | carcolor | |
---|---|---|---|
0 | John | Blue | NaN |
1 | George | Blue | Blue |
2 | Ringo | Purple | NaN |
3 | Paul | NaN | Red |
df1.merge(df2, how="left")
name | color | carcolor | |
---|---|---|---|
0 | John | Blue | NaN |
1 | George | Blue | Blue |
2 | Ringo | Purple | NaN |
Por defecto, pandas cruza por aquellas columans que tengan el mismo nombre. Sin embargo a veces tenemos que pasarle el nombre de las columas, para ello tenemos los atributos on
, left_on
y right_on
df1.merge(df2, how="right", left_on="color", right_on="carcolor")
name_x | color | name_y | carcolor | |
---|---|---|---|---|
0 | NaN | NaN | Paul | Red |
1 | John | Blue | George | Blue |
2 | George | Blue | George | Blue |
3 | NaN | NaN | Ringo | NaN |
Pandas incorpora también el método join
, que es similar a merge
pero opera con los ínidices de los dataframes, no las columnas.