Manejo de Datos Faltantes: Imputación#

Configuración de ambiente de trabajo#

pip install -r requirements.txt
^C
Requirement already satisfied: pip in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 1)) (23.1.2)
Collecting pyjanitor (from -r requirements.txt (line 2))
  Using cached pyjanitor-0.25.0-py3-none-any.whl (171 kB)
Requirement already satisfied: matplotlib in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 3)) (3.7.2)
Collecting missingno (from -r requirements.txt (line 4))
  Using cached missingno-0.5.2-py3-none-any.whl (8.7 kB)
Requirement already satisfied: numpy in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 5)) (1.25.2)
Requirement already satisfied: pandas in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 6)) (2.1.0)
Requirement already satisfied: pandas-flavor in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 7)) (0.6.0)
Collecting pandas-profiling (from -r requirements.txt (line 8))
  Using cached pandas_profiling-3.2.0-py2.py3-none-any.whl (262 kB)
Requirement already satisfied: seaborn in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 9)) (0.12.2)
Collecting session-info (from -r requirements.txt (line 10))
  Using cached session_info-1.0.0.tar.gz (24 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting ipywidgets (from -r requirements.txt (line 11))
  Using cached ipywidgets-8.1.0-py3-none-any.whl (139 kB)
Collecting ozon3 (from -r requirements.txt (line 12))
  Using cached ozon3-4.0.2-py3-none-any.whl (27 kB)
Collecting pyreadr (from -r requirements.txt (line 13))
  Using cached pyreadr-0.4.9-cp311-cp311-win_amd64.whl (1.3 MB)
Collecting upsetplot (from -r requirements.txt (line 14))
  Using cached UpSetPlot-0.8.0.tar.gz (22 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting statsmodels==0.13.2 (from -r requirements.txt (line 15))
  Using cached statsmodels-0.13.2.tar.gz (17.9 MB)
  Installing build dependencies: started
Note: you may need to restart the kernel to use updated packages.

Importar librerías#

import janitor
import matplotlib.pyplot as plt
import missingno
import nhanes.load
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
import session_info
import sklearn.compose
import sklearn.impute
import sklearn.preprocessing
import statsmodels.api as sm
import statsmodels.datasets
import statsmodels.formula.api as smf

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.neighbors import KNeighborsRegressor
from statsmodels.graphics.mosaicplot import mosaic

Importar funciones personalizadas#

%run pandas-missing-extension.ipynb
/root/venv/lib/python3.9/site-packages/upsetplot/plotting.py:20: MatplotlibDeprecationWarning: The matplotlib.tight_layout module was deprecated in Matplotlib 3.6 and will be removed two minor releases later.
  from matplotlib.tight_layout import get_renderer

Configurar el aspecto general de las gráficas del proyecto#

%matplotlib inline

sns.set(
    rc={
        "figure.figsize": (8, 6)
    }
)

sns.set_style("whitegrid")

El problema de trabajar con valores faltantes#

airquality_df = (
    sm.datasets.get_rdataset("airquality")
    .data
    .clean_names(
        case_type = "snake"
    )
    .add_column("year", 1973)
    .assign(
        date = lambda df: pd.to_datetime(df[["year", "month", "day"]])
    )
    .sort_values(by = "date")
    .set_index("date")
)

airquality_df
ozone solar_r wind temp month day year
date
1973-05-01 41.0 190.0 7.4 67 5 1 1973
1973-05-02 36.0 118.0 8.0 72 5 2 1973
1973-05-03 12.0 149.0 12.6 74 5 3 1973
1973-05-04 18.0 313.0 11.5 62 5 4 1973
1973-05-05 NaN NaN 14.3 56 5 5 1973
... ... ... ... ... ... ... ...
1973-09-26 30.0 193.0 6.9 70 9 26 1973
1973-09-27 NaN 145.0 13.2 77 9 27 1973
1973-09-28 14.0 191.0 14.3 75 9 28 1973
1973-09-29 18.0 131.0 8.0 76 9 29 1973
1973-09-30 20.0 223.0 11.5 68 9 30 1973

153 rows × 7 columns

Hagamos dos modelos de regresion para predecir la temperatura en funcion de otras variables:

(
    smf.ols(
        formula = "temp ~ ozone",
        data = airquality_df
    )
    .fit()
    .summary()
    .tables[0]
)
OLS Regression Results
Dep. Variable: temp R-squared: 0.488
Model: OLS Adj. R-squared: 0.483
Method: Least Squares F-statistic: 108.5
Date: Wed, 25 Jan 2023 Prob (F-statistic): 2.93e-18
Time: 03:02:38 Log-Likelihood: -386.27
No. Observations: 116 AIC: 776.5
Df Residuals: 114 BIC: 782.1
Df Model: 1
Covariance Type: nonrobust
(
    smf.ols(
        formula = "temp ~ ozone + solar_r",
        data = airquality_df
    )
    .fit()
    .summary()
    .tables[0]
)
OLS Regression Results
Dep. Variable: temp R-squared: 0.491
Model: OLS Adj. R-squared: 0.481
Method: Least Squares F-statistic: 52.07
Date: Wed, 25 Jan 2023 Prob (F-statistic): 1.47e-16
Time: 03:02:38 Log-Likelihood: -369.78
No. Observations: 111 AIC: 745.6
Df Residuals: 108 BIC: 753.7
Df Model: 2
Covariance Type: nonrobust

Si analizas el numero de observaciones, veras que un modelo ignora cierta cantidad de valores faltantes mientras que el otro no lo hace. No es recomendable en este caso usar el R^2 para comparar ambos modelos como equivalentes. Por lo tanto toca imputar datos y asi poder realizarlo

Preparando datos: National Health and Nutrition Examination Survey#

%run download-data-and-load-it.ipynb
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:73: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:142: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax1.grid(b=False)
../_images/0a82d7684ac2acda1aee5efc7852176c33d696a6a7ceeefc4aa1e7abc845c143.png ../_images/245ef82448e0916d151b1313f963eb9bc0d60119e58a9d1ef99463d539215aaf.png
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:73: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:142: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax1.grid(b=False)
../_images/84302e5baba86e865c03e5998b227de0061677a68182eda3b4d11f2570234303.png

Consideración y evaluación de los distintos tipos de valores faltantes#

Evaluación del mecanismo de valores faltantes por prueba de t-test#

📘 Información

two-sided: las medias de las distribuciones subyacentes a las muestras son desiguales.
less: la media de la distribución subyacente a la primera muestra es menor que la media de la distribución subyacente a la segunda muestra.
greater: la media de la distribución subyacente a la primera muestra es mayor que la media de la distribución subyacente a la segunda muestra.

female_weight, male_weight = (
    nhanes_df
    .select_columns("gender", "weight")
    .transform_column(
        "weight",
        lambda x: x.isna(),
        elementwise = False
    )
    .groupby("gender")
    .weight
    .pipe(
        lambda df: (
            df.get_group("Female"),
            df.get_group("Male")
        )
    )
)

# Ahora se hace una prueba estadistica para determinar si hay o no diferencia en la presencia
# o ausencia de valores de peso

scipy.stats.ttest_ind(
    a = female_weight,
    b = male_weight,
    alternative="two-sided"
)
Ttest_indResult(statistic=-0.3621032192538131, pvalue=0.7172855918077239)

Dado que pvalue < 0.05, no se rechaza la hipotesis nula.

No existe diferencia estadistica entre los valores faltantes

Los datos entonces no estan perdidos al azar entre hombres y mujeres

Tratamiento de variables categóricas para imputación de valores faltantes#

# Primero se realiza una copia del dataframe original
nhanes_transformed_df = nhanes_df.copy(deep=True)

Codificación ordinal#

📘 Información

Una codificación ordinal implica mapear cada etiqueta (categoría) única a un valor entero. A su vez, la codificación ordinal también es conocida como codificación entera.

Ejemplo#

Dado un conjunto de datos con dos características, encontraremos los valores únicos por cataracterística y los transformaremos utilizando una codificación ordinal.

encoder = sklearn.preprocessing.OrdinalEncoder()

X = [["Male"], ["Female"], ["Female"]]
X
[['Male'], ['Female'], ['Female']]
encoder.fit_transform(X)
array([[1.],
       [0.],
       [0.]])
encoder.categories_
[array(['Female', 'Male'], dtype=object)]
encoder.inverse_transform([[1], [0], [0]])
array([['Male'],
       ['Female'],
       ['Female']], dtype=object)

Aplicando la codificación ordinal a todas tus variables categóricas#

categorical_columns = nhanes_df.select_dtypes(include=[object, "category"]).columns
categorical_transformer = sklearn.compose.make_column_transformer(
    (sklearn.preprocessing.OrdinalEncoder(), categorical_columns),
    remainder="passthrough"
    #aqui le dices que no modifique las variables que no son categoricas
)
nhanes_transformed_df = (
    pd.DataFrame(
        categorical_transformer.fit_transform(nhanes_df),
        columns = categorical_transformer.get_feature_names_out(),
        index = nhanes_df.index
    )
    .rename_columns(
        function = lambda x: x.removeprefix("ordinalencoder__")
    )
    .rename_columns(
        function = lambda x: x.removeprefix("remainder__")
    )
)

nhanes_transformed_df
general_health_condition gender height weight total_cholesterol pulse diabetes age
SEQN
93705.0 2.0 0.0 63.0 165.0 157.0 52.0 0.0 66.0
93706.0 4.0 1.0 68.0 145.0 148.0 82.0 0.0 18.0
93707.0 2.0 1.0 NaN NaN 189.0 100.0 0.0 13.0
93709.0 NaN 0.0 62.0 200.0 176.0 74.0 0.0 75.0
93711.0 4.0 1.0 69.0 142.0 238.0 62.0 0.0 56.0
... ... ... ... ... ... ... ... ...
102949.0 0.0 1.0 72.0 180.0 201.0 96.0 0.0 33.0
102953.0 1.0 1.0 65.0 218.0 182.0 78.0 0.0 42.0
102954.0 2.0 0.0 66.0 150.0 172.0 78.0 0.0 41.0
102955.0 4.0 0.0 NaN NaN 150.0 74.0 0.0 14.0
102956.0 2.0 1.0 69.0 250.0 163.0 76.0 0.0 38.0

7157 rows × 8 columns

One Hot Encoding#

nhanes_transformed_df2 = nhanes_df.copy(deep=True)

pandas.get_dummies() vs skelearn.preprocessing.OneHotEncoder()#

pandas.get_dummies()#

pandas.get_dummies() tiene desventajas comparado con sklearn.preprocessing.onehotencoder(), estas son:

  • No sabe lidiar con valores nulos

  • Si no se seleccionan todas las columnas, puede que no se detecten todas las categorias

(
    nhanes_transformed_df2
    .select_columns("general_health_condition")
    .pipe(pd.get_dummies)
)
general_health_condition_Excellent general_health_condition_Fair or general_health_condition_Good general_health_condition_Poor? general_health_condition_Very good
SEQN
93705.0 0 0 1 0 0
93706.0 0 0 0 0 1
93707.0 0 0 1 0 0
93709.0 0 0 0 0 0
93711.0 0 0 0 0 1
... ... ... ... ... ...
102949.0 1 0 0 0 0
102953.0 0 1 0 0 0
102954.0 0 0 1 0 0
102955.0 0 0 0 0 1
102956.0 0 0 1 0 0

7157 rows × 5 columns

skelearn.preprocessing.OneHotEncoder()#

Esta opcion tiene la ventaja que va a detectar automaticamente todas las categorias

transformer = sklearn.compose.make_column_transformer(
    (sklearn.preprocessing.OrdinalEncoder(), ["gender"]),
    (sklearn.preprocessing.OneHotEncoder(), ["general_health_condition"]),
    remainder="passthrough"
)
nhanes_transformed_df2 = (
    pd.DataFrame(
        transformer.fit_transform(nhanes_df),
        columns = transformer.get_feature_names_out(),
        index = nhanes_df.index
    )
    .rename_columns(
        function = lambda x: x.removeprefix("ordinalencoder__")
    )
    .rename_columns(
        function = lambda x: x.removeprefix("remainder__")
    )
    .rename_columns(
        function = lambda x: x.removeprefix("onehotencoder__")
    )
)

nhanes_transformed_df2
gender general_health_condition_Excellent general_health_condition_Fair or general_health_condition_Good general_health_condition_Poor? general_health_condition_Very good general_health_condition_nan height weight total_cholesterol pulse diabetes age
SEQN
93705.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 63.0 165.0 157.0 52.0 0.0 66.0
93706.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 68.0 145.0 148.0 82.0 0.0 18.0
93707.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 189.0 100.0 0.0 13.0
93709.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 62.0 200.0 176.0 74.0 0.0 75.0
93711.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 69.0 142.0 238.0 62.0 0.0 56.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
102949.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 72.0 180.0 201.0 96.0 0.0 33.0
102953.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 65.0 218.0 182.0 78.0 0.0 42.0
102954.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 66.0 150.0 172.0 78.0 0.0 41.0
102955.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 NaN NaN 150.0 74.0 0.0 14.0
102956.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 69.0 250.0 163.0 76.0 0.0 38.0

7157 rows × 13 columns

(
    transformer
    .named_transformers_
    .get("onehotencoder")
    .categories_
)
[array(['Excellent', 'Fair or', 'Good', 'Poor?', 'Very good', nan],
       dtype=object)]
(
    transformer
    .named_transformers_
    .get("onehotencoder")
    .inverse_transform(
        X = [[0, 0, 1, 0, 0, 0]]
    )
)
array([['Good']], dtype=object)

Tipos de imputación de valores faltantes#

Imputación de un único valor (media, mediana, moda)#

Media#

(
    nhanes_df
    # janitor
    .transform_column(
        "height",
        lambda x: x.fillna(x.mean()),
        elementwise=False
    )
)
height weight general_health_condition total_cholesterol pulse diabetes age gender
SEQN
93705.0 63.00000 165.0 Good 157.0 52.0 0 66.0 Female
93706.0 68.00000 145.0 Very good 148.0 82.0 0 18.0 Male
93707.0 66.25656 NaN Good 189.0 100.0 0 13.0 Male
93709.0 62.00000 200.0 NaN 176.0 74.0 0 75.0 Female
93711.0 69.00000 142.0 Very good 238.0 62.0 0 56.0 Male
... ... ... ... ... ... ... ... ...
102949.0 72.00000 180.0 Excellent 201.0 96.0 0 33.0 Male
102953.0 65.00000 218.0 Fair or 182.0 78.0 0 42.0 Male
102954.0 66.00000 150.0 Good 172.0 78.0 0 41.0 Female
102955.0 66.25656 NaN Very good 150.0 74.0 0 14.0 Female
102956.0 69.00000 250.0 Good 163.0 76.0 0 38.0 Male

7157 rows × 8 columns

(
    nhanes_df
    .select_columns("height", "weight")
    .missing.bind_shadow_matrix(True,False, suffix = "_imp")
    .assign(
        height = lambda df: df.height.fillna(value = df.height.mean()),
        weight = lambda df: df.weight.fillna(value = df.weight.mean())
    )
    .missing.scatter_imputation_plot(
        x = "height",
        y = "weight"
    )
)
<AxesSubplot: xlabel='height', ylabel='weight'>
../_images/1f40e564fd9eeb2eb0042250c419b94358498264bb9287c3dc0b4fa5c940bf9d.png
(
    nhanes_df
    .select_columns("height", "weight")
    .missing.bind_shadow_matrix(True,False, suffix = "_imp")
    .assign(
        height = lambda df: df.height.fillna(value = df.height.mean()),
        weight = lambda df: df.weight.fillna(value = df.weight.mean())
    )
    .missing.scatter_imputation_plot(
        x = "height",
        y = "weight",
        show_marginal= True,
        height = 10
    )
)
<seaborn.axisgrid.JointGrid at 0x7f756fdf28b0>
../_images/936946e4d2693fb21b52243620b2f11f4bd700222de25a61ffd96d6f65c2f5ed.png

Imputación por llenado hacia atrás e imputación por llenado hacia adelante#

fillna() vs ffill() o bfill()#

Forwardfill#

Imputa el dato de la fila anterior en nuestro missing value

(
    nhanes_df
    .select_columns("height", "weight")
    
    # .fillna(method = "ffill")
    .ffill()
)
height weight
SEQN
93705.0 63.0 165.0
93706.0 68.0 145.0
93707.0 68.0 145.0
93709.0 62.0 200.0
93711.0 69.0 142.0
... ... ...
102949.0 72.0 180.0
102953.0 65.0 218.0
102954.0 66.0 150.0
102955.0 66.0 150.0
102956.0 69.0 250.0

7157 rows × 2 columns

Backfill#

Imputa el dato de la fila posterior en nuestro missing value

(
    nhanes_df
    .select_columns("height", "weight")
    
    # .fillna(method = "bfill")
    .bfill()
)
height weight
SEQN
93705.0 63.0 165.0
93706.0 68.0 145.0
93707.0 62.0 200.0
93709.0 62.0 200.0
93711.0 69.0 142.0
... ... ...
102949.0 72.0 180.0
102953.0 65.0 218.0
102954.0 66.0 150.0
102955.0 69.0 250.0
102956.0 69.0 250.0

7157 rows × 2 columns

Recomendaciones al imputar valores utilizando ffill() o bfill()#

Imputación dentro de dominios e imputación a través de variables correlacionadas

Se recomienda organizar por grupos similares para que no se imputen datos aleatoriamente, sino que el dato que se va a imputar tenga sentido

(
    nhanes_df
    .select_columns("height", "weight", "gender", "diabetes", "general_health_condition")
    .sort_values(
        by = ["gender", "diabetes", "general_health_condition", "height"],
        ascending = True
    )
    .transform_column(
        "weight",
        lambda x: x.ffill(),
        elementwise = False
    )
)
height weight gender diabetes general_health_condition
SEQN
94421.0 56.0 115.0 Female 0 Excellent
94187.0 59.0 130.0 Female 0 Excellent
95289.0 59.0 162.0 Female 0 Excellent
97967.0 59.0 130.0 Female 0 Excellent
99125.0 59.0 105.0 Female 0 Excellent
... ... ... ... ... ...
96561.0 74.0 290.0 Male 1 NaN
96954.0 NaN 175.0 Male 1 NaN
97267.0 NaN 175.0 Male 1 NaN
97856.0 NaN 175.0 Male 1 NaN
98317.0 NaN 175.0 Male 1 NaN

7157 rows × 5 columns

(
    nhanes_df
    .select_columns("height", "weight", "gender", "diabetes", "general_health_condition")
    .sort_values(
        by=['gender','diabetes','general_health_condition','weight'],
        ascending=True
    )
    .groupby(["gender", "general_health_condition"], dropna=False)
    .apply(lambda x: x.ffill())
)
/tmp/ipykernel_1384/909901090.py:2: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  nhanes_df
height weight gender diabetes general_health_condition
SEQN
94195.0 63.0 90.0 Female 0 Excellent
95793.0 61.0 96.0 Female 0 Excellent
101420.0 59.0 98.0 Female 0 Excellent
94148.0 65.0 100.0 Female 0 Excellent
102062.0 62.0 100.0 Female 0 Excellent
... ... ... ... ... ...
96561.0 74.0 290.0 Male 1 NaN
96869.0 72.0 298.0 Male 1 NaN
97267.0 72.0 298.0 Male 1 NaN
97856.0 72.0 298.0 Male 1 NaN
98317.0 72.0 298.0 Male 1 NaN

7157 rows × 5 columns

Imputación por interpolación#

(
    airquality_df
    .select_columns("ozone")
    .pipe(
        lambda df: (
            df.ozone.plot(color = "#313638", marker= "o")
        )
    )
)
<AxesSubplot: xlabel='date'>
../_images/06cdefbfc49f9859a0c72c6482dce31d586443ec141dea781d7922b3b618dee0.png

Visualizando con ffill#

plt.figure(figsize=(20,10))

(
    airquality_df
    .select_columns("ozone")
    .pipe(
        lambda df: (
            df.ozone.ffill().plot(color= "red", marker="o", alpha=6/9, linestyle="dashed"),
            df.ozone.plot(color = "#313638", marker= "o")
        )
    )
)
(<AxesSubplot: xlabel='date'>, <AxesSubplot: xlabel='date'>)
../_images/4d4cded0354aa199493e9fc28759e9b2d3ad15e5e28868c8cd0ac1b43006121d.png

Visualizando con bfill#

plt.figure(figsize=(20,10))

(
    airquality_df
    .select_columns("ozone")
    .pipe(
        lambda df: (
            df.ozone.bfill().plot(color= "red", marker="o", alpha=6/9, linestyle="dashed"),
            df.ozone.plot(color = "#313638", marker= "o")
        )
    )
)
(<AxesSubplot: xlabel='date'>, <AxesSubplot: xlabel='date'>)
../_images/6965c7ecd8a29ec70341a96d5c477813bac89bfa168dd0a037554409092963cd.png

Visualizando con bfill o ffill con valores mas cercanos#

plt.figure(figsize=(20,10))

(
    airquality_df
    .select_columns("ozone")
    .pipe(
        lambda df: (
            df.ozone.interpolate(method = "nearest").plot(color= "red", marker="o", alpha=6/9, linestyle="dashed"),
            df.ozone.plot(color = "#313638", marker= "o")
        )
    )
)
(<AxesSubplot: xlabel='date'>, <AxesSubplot: xlabel='date'>)
../_images/a5dcc5a14176f1eb6376592fe6d9c159031f58eba4848a142a2c0b11c1e14239.png

Visualizando valores interpolados#

plt.figure(figsize=(20,10))

(
    airquality_df
    .select_columns("ozone")
    .pipe(
        lambda df: (
            df.ozone.interpolate(method = "linear").plot(color= "red", marker="o", alpha=6/9, linestyle="dashed"),
            df.ozone.plot(color = "#313638", marker= "o")
        )
    )
)
(<AxesSubplot: xlabel='date'>, <AxesSubplot: xlabel='date'>)
../_images/71e292d0597e5eda5b14546862186b139b989f2565bd88f99323680d2163d353.png

Ahora para asignar esos valores a la columna:

airquality_df["ozone"] = airquality_df.ozone.interpolate(method="linear")
airquality_df
ozone solar_r wind temp month day year
date
1973-05-01 41.0 190.0 7.4 67 5 1 1973
1973-05-02 36.0 118.0 8.0 72 5 2 1973
1973-05-03 12.0 149.0 12.6 74 5 3 1973
1973-05-04 18.0 313.0 11.5 62 5 4 1973
1973-05-05 23.0 NaN 14.3 56 5 5 1973
... ... ... ... ... ... ... ...
1973-09-26 30.0 193.0 6.9 70 9 26 1973
1973-09-27 22.0 145.0 13.2 77 9 27 1973
1973-09-28 14.0 191.0 14.3 75 9 28 1973
1973-09-29 18.0 131.0 8.0 76 9 29 1973
1973-09-30 20.0 223.0 11.5 68 9 30 1973

153 rows × 7 columns

Imputación por algoritmo de vecinos más cercanos (KNN)#

K nearest neighbors (KNN) es el algritmo mas utilizado de imputacion

Resumen de métodos de distancia:

  • Euclidiana: Útil para variables numéricas

  • Manhattan: Útil paa variables tipo factor

  • Hamming: Útil para variables categóricas

  • Gower: Útil para conjuntos de datos con variables mixtas

nhanes_transformed_df
general_health_condition gender height weight total_cholesterol pulse diabetes age
SEQN
93705.0 2.0 0.0 63.0 165.0 157.0 52.0 0.0 66.0
93706.0 4.0 1.0 68.0 145.0 148.0 82.0 0.0 18.0
93707.0 2.0 1.0 NaN NaN 189.0 100.0 0.0 13.0
93709.0 NaN 0.0 62.0 200.0 176.0 74.0 0.0 75.0
93711.0 4.0 1.0 69.0 142.0 238.0 62.0 0.0 56.0
... ... ... ... ... ... ... ... ...
102949.0 0.0 1.0 72.0 180.0 201.0 96.0 0.0 33.0
102953.0 1.0 1.0 65.0 218.0 182.0 78.0 0.0 42.0
102954.0 2.0 0.0 66.0 150.0 172.0 78.0 0.0 41.0
102955.0 4.0 0.0 NaN NaN 150.0 74.0 0.0 14.0
102956.0 2.0 1.0 69.0 250.0 163.0 76.0 0.0 38.0

7157 rows × 8 columns

KNN algorithm#

knn_imputer = sklearn.impute.KNNImputer()

#copia del df original
nhanes_df_knn = nhanes_transformed_df.copy(deep=True)

nhanes_df_knn.iloc[:, :] = knn_imputer.fit_transform(nhanes_transformed_df).round()
nhanes_df_knn
general_health_condition gender height weight total_cholesterol pulse diabetes age
SEQN
93705.0 2.0 0.0 63.0 165.0 157.0 52.0 0.0 66.0
93706.0 4.0 1.0 68.0 145.0 148.0 82.0 0.0 18.0
93707.0 2.0 1.0 69.0 130.0 189.0 100.0 0.0 13.0
93709.0 2.0 0.0 62.0 200.0 176.0 74.0 0.0 75.0
93711.0 4.0 1.0 69.0 142.0 238.0 62.0 0.0 56.0
... ... ... ... ... ... ... ... ...
102949.0 0.0 1.0 72.0 180.0 201.0 96.0 0.0 33.0
102953.0 1.0 1.0 65.0 218.0 182.0 78.0 0.0 42.0
102954.0 2.0 0.0 66.0 150.0 172.0 78.0 0.0 41.0
102955.0 4.0 0.0 71.0 159.0 150.0 74.0 0.0 14.0
102956.0 2.0 1.0 69.0 250.0 163.0 76.0 0.0 38.0

7157 rows × 8 columns

(
    pd.concat(
        [
            nhanes_df_knn,
            nhanes_df.missing.create_shadow_matrix(True, False, suffix="_imp", only_missing=True)
        ],
        axis=1
    )
)
general_health_condition gender height weight total_cholesterol pulse diabetes age height_imp weight_imp general_health_condition_imp total_cholesterol_imp pulse_imp
SEQN
93705.0 2.0 0.0 63.0 165.0 157.0 52.0 0.0 66.0 False False False False False
93706.0 4.0 1.0 68.0 145.0 148.0 82.0 0.0 18.0 False False False False False
93707.0 2.0 1.0 69.0 130.0 189.0 100.0 0.0 13.0 True True False False False
93709.0 2.0 0.0 62.0 200.0 176.0 74.0 0.0 75.0 False False True False False
93711.0 4.0 1.0 69.0 142.0 238.0 62.0 0.0 56.0 False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
102949.0 0.0 1.0 72.0 180.0 201.0 96.0 0.0 33.0 False False False False False
102953.0 1.0 1.0 65.0 218.0 182.0 78.0 0.0 42.0 False False False False False
102954.0 2.0 0.0 66.0 150.0 172.0 78.0 0.0 41.0 False False False False False
102955.0 4.0 0.0 71.0 159.0 150.0 74.0 0.0 14.0 True True False False False
102956.0 2.0 1.0 69.0 250.0 163.0 76.0 0.0 38.0 False False False False False

7157 rows × 13 columns

(
    pd.concat(
        [
            nhanes_df_knn,
            nhanes_df.missing.create_shadow_matrix(True, False, suffix="_imp", only_missing=True)
        ],
        axis=1
    )
    .missing.scatter_imputation_plot(
        x = "height",
        y = "weight"
    )
)
<AxesSubplot: xlabel='height', ylabel='weight'>
../_images/f26bce62af0a569d9ebe6e775aeab1c2e42c2a7c59a96efd4597510e6f5b43e3.png

Ordenamiento por cantidad de variables faltantes#

knn_imputer = sklearn.impute.KNNImputer()

#copia del df original
nhanes_df_knn = nhanes_transformed_df.missing.sort_variables_by_missingness(ascending=True).copy(deep=True)

nhanes_df_knn.iloc[:, :] = knn_imputer.fit_transform(nhanes_transformed_df.missing.sort_variables_by_missingness(ascending=True)).round()
nhanes_df_knn
gender diabetes age pulse total_cholesterol general_health_condition weight height
SEQN
93705.0 0.0 0.0 66.0 52.0 157.0 2.0 165.0 63.0
93706.0 1.0 0.0 18.0 82.0 148.0 4.0 145.0 68.0
93707.0 1.0 0.0 13.0 100.0 189.0 2.0 130.0 69.0
93709.0 0.0 0.0 75.0 74.0 176.0 2.0 200.0 62.0
93711.0 1.0 0.0 56.0 62.0 238.0 4.0 142.0 69.0
... ... ... ... ... ... ... ... ...
102949.0 1.0 0.0 33.0 96.0 201.0 0.0 180.0 72.0
102953.0 1.0 0.0 42.0 78.0 182.0 1.0 218.0 65.0
102954.0 0.0 0.0 41.0 78.0 172.0 2.0 150.0 66.0
102955.0 0.0 0.0 14.0 74.0 150.0 4.0 159.0 71.0
102956.0 1.0 0.0 38.0 76.0 163.0 2.0 250.0 69.0

7157 rows × 8 columns

(
    pd.concat(
        [
            nhanes_df_knn,
            nhanes_df.missing.create_shadow_matrix(True, False, suffix="_imp", only_missing=True)
        ],
        axis=1
    )
    .missing.scatter_imputation_plot(
        x = "height",
        y = "weight"
    )
)
<AxesSubplot: xlabel='height', ylabel='weight'>
../_images/f26bce62af0a569d9ebe6e775aeab1c2e42c2a7c59a96efd4597510e6f5b43e3.png

Imputación basada en modelos#

nhanes_model_df = (
    nhanes_df
    .select_columns("height", "weight", "gender", "age")
    .sort_values("weight")
    .transform_column(
        "weight",
        lambda x: x.ffill(),
        elementwise = False
    )
    .missing.bind_shadow_matrix(
        True,
        False,
        suffix="_imp",
        only_missing = False
    )
)

nhanes_model_df
height weight gender age height_imp weight_imp gender_imp age_imp
SEQN
99970.0 62.0 72.0 Male 16.0 False False False False
97877.0 50.0 80.0 Female 29.0 False False False False
96451.0 59.0 82.0 Female 80.0 False False False False
97730.0 62.0 82.0 Female 19.0 False False False False
99828.0 62.0 85.0 Female 16.0 False False False False
... ... ... ... ... ... ... ... ...
102915.0 NaN 484.0 Female 14.0 True False False False
102926.0 NaN 484.0 Female 15.0 True False False False
102941.0 NaN 484.0 Female 14.0 True False False False
102945.0 NaN 484.0 Male 15.0 True False False False
102955.0 NaN 484.0 Female 14.0 True False False False

7157 rows × 8 columns

height_ols = (
    nhanes_model_df
    .pipe(
        lambda df: smf.ols("height ~ weight + age", data=df)
    )
    .fit()
)
ols_imputed_values = (
    nhanes_model_df
    .pipe(
        lambda df: df[df.height.isna()]
    )
    .pipe(
        lambda df: height_ols.predict(df).round()
    )
)

ols_imputed_values
SEQN
100246.0    64.0
101209.0    64.0
102766.0    64.0
102509.0    64.0
96900.0     65.0
            ... 
102915.0    75.0
102926.0    75.0
102941.0    75.0
102945.0    75.0
102955.0    75.0
Length: 1669, dtype: float64
nhanes_model_df.loc[nhanes_model_df.height.isna(), ["height"]] = ols_imputed_values
nhanes_model_df
height weight gender age height_imp weight_imp gender_imp age_imp
SEQN
99970.0 62.0 72.0 Male 16.0 False False False False
97877.0 50.0 80.0 Female 29.0 False False False False
96451.0 59.0 82.0 Female 80.0 False False False False
97730.0 62.0 82.0 Female 19.0 False False False False
99828.0 62.0 85.0 Female 16.0 False False False False
... ... ... ... ... ... ... ... ...
102915.0 75.0 484.0 Female 14.0 True False False False
102926.0 75.0 484.0 Female 15.0 True False False False
102941.0 75.0 484.0 Female 14.0 True False False False
102945.0 75.0 484.0 Male 15.0 True False False False
102955.0 75.0 484.0 Female 14.0 True False False False

7157 rows × 8 columns

(
    nhanes_model_df
    .missing.scatter_imputation_plot(
        x = "weight",
        y = "height"
    )
)
<AxesSubplot: xlabel='weight', ylabel='height'>
../_images/f06d6c81ceea57bcac4436d8bd41709356f34c66cda42f275ca2d6b812e7f243.png

Imputaciones Múltiples por Ecuaciones Encadenadas (MICE)#

mice_imputer = sklearn.impute.IterativeImputer(
    estimator=BayesianRidge(),
    initial_strategy="mean",
    imputation_order="ascending"
)

nhanes_mice_df = nhanes_transformed_df.copy(deep=True)

nhanes_mice_df.iloc[:, :] = mice_imputer.fit_transform(nhanes_transformed_df).round()

nhanes_mice_df = pd.concat(
    [
        nhanes_mice_df,
        nhanes_df.missing.create_shadow_matrix(True, False, suffix="_imp")
    ],    
    axis=1
)

nhanes_mice_df
general_health_condition gender height weight total_cholesterol pulse diabetes age height_imp weight_imp general_health_condition_imp total_cholesterol_imp pulse_imp diabetes_imp age_imp gender_imp
SEQN
93705.0 2.0 0.0 63.0 165.0 157.0 52.0 0.0 66.0 False False False False False False False False
93706.0 4.0 1.0 68.0 145.0 148.0 82.0 0.0 18.0 False False False False False False False False
93707.0 2.0 1.0 70.0 200.0 189.0 100.0 0.0 13.0 True True False False False False False False
93709.0 2.0 0.0 62.0 200.0 176.0 74.0 0.0 75.0 False False True False False False False False
93711.0 4.0 1.0 69.0 142.0 238.0 62.0 0.0 56.0 False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
102949.0 0.0 1.0 72.0 180.0 201.0 96.0 0.0 33.0 False False False False False False False False
102953.0 1.0 1.0 65.0 218.0 182.0 78.0 0.0 42.0 False False False False False False False False
102954.0 2.0 0.0 66.0 150.0 172.0 78.0 0.0 41.0 False False False False False False False False
102955.0 4.0 0.0 64.0 159.0 150.0 74.0 0.0 14.0 True True False False False False False False
102956.0 2.0 1.0 69.0 250.0 163.0 76.0 0.0 38.0 False False False False False False False False

7157 rows × 16 columns

nhanes_mice_df.missing.scatter_imputation_plot(
    x = "height",
    y = "weight"
)
<AxesSubplot: xlabel='height', ylabel='weight'>
../_images/00643aca72aa70d7e3687e281104e52561cfbeee1839b9343164d27b98458933.png

Transformación inversa de los datos#

Deshacer la transformacion en dummy

nhanes_imputed_df = nhanes_mice_df.copy(deep=True)
nhanes_imputed_df[categorical_columns] = (
    categorical_transformer
    .named_transformers_
    .ordinalencoder
    .inverse_transform(
        X = nhanes_mice_df[categorical_columns]
    )
)

nhanes_imputed_df
general_health_condition gender height weight total_cholesterol pulse diabetes age height_imp weight_imp general_health_condition_imp total_cholesterol_imp pulse_imp diabetes_imp age_imp gender_imp
SEQN
93705.0 Good Female 63.0 165.0 157.0 52.0 0.0 66.0 False False False False False False False False
93706.0 Very good Male 68.0 145.0 148.0 82.0 0.0 18.0 False False False False False False False False
93707.0 Good Male 70.0 200.0 189.0 100.0 0.0 13.0 True True False False False False False False
93709.0 Good Female 62.0 200.0 176.0 74.0 0.0 75.0 False False True False False False False False
93711.0 Very good Male 69.0 142.0 238.0 62.0 0.0 56.0 False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
102949.0 Excellent Male 72.0 180.0 201.0 96.0 0.0 33.0 False False False False False False False False
102953.0 Fair or Male 65.0 218.0 182.0 78.0 0.0 42.0 False False False False False False False False
102954.0 Good Female 66.0 150.0 172.0 78.0 0.0 41.0 False False False False False False False False
102955.0 Very good Female 64.0 159.0 150.0 74.0 0.0 14.0 True True False False False False False False
102956.0 Good Male 69.0 250.0 163.0 76.0 0.0 38.0 False False False False False False False False

7157 rows × 16 columns

Comparativa de valores antes de imputacion con valores despues de imputacion

nhanes_df.general_health_condition.value_counts()
Good         2383
Very good    1503
Fair or      1130
Excellent     612
Poor?         169
Name: general_health_condition, dtype: int64
nhanes_imputed_df.general_health_condition.value_counts()
Good         3743
Very good    1503
Fair or      1130
Excellent     612
Poor?         169
Name: general_health_condition, dtype: int64

Me quedan entonces valores faltantes?

nhanes_imputed_df.missing.number_missing()
0
Created in deepnote.com Created in Deepnote