Manejo de Datos Faltantes: Detección

Manejo de Datos Faltantes: Detección#

Preparacion de entorno#

pip install -r requirements.txt

^C
Requirement already satisfied: pip in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 1)) (23.1.2)
Collecting pyjanitor (from -r requirements.txt (line 2))
  Using cached pyjanitor-0.25.0-py3-none-any.whl (171 kB)
Collecting missingno (from -r requirements.txt (line 3))
  Using cached missingno-0.5.2-py3-none-any.whl (8.7 kB)
Requirement already satisfied: numpy in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 4)) (1.25.2)
Collecting matplotlib==3.5.1 (from -r requirements.txt (line 5))
  Using cached matplotlib-3.5.1.tar.gz (35.3 MB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: pandas in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 6)) (2.1.0)
Collecting pyreadr (from -r requirements.txt (line 7))
  Using cached pyreadr-0.4.9-cp311-cp311-win_amd64.whl (1.3 MB)
Requirement already satisfied: seaborn in c:\users\admin\miniconda3\lib\site-packages (from -r requirements.txt (line 8)) (0.12.2)
Collecting session-info (from -r requirements.txt (line 9))
  Using cached session_info-1.0.0.tar.gz (24 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting upsetplot==0.6.1 (from -r requirements.txt (line 10))
  Using cached UpSetPlot-0.6.1.tar.gz (18 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cycler>=0.10 in c:\users\admin\miniconda3\lib\site-packages (from matplotlib==3.5.1->-r requirements.txt (line 5)) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\admin\miniconda3\lib\site-packages (from matplotlib==3.5.1->-r requirements.txt (line 5)) (4.42.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\admin\miniconda3\lib\site-packages (from matplotlib==3.5.1->-r requirements.txt (line 5)) (1.4.5)
Requirement already satisfied: packaging>=20.0 in c:\users\admin\miniconda3\lib\site-packages (from matplotlib==3.5.1->-r requirements.txt (line 5)) (23.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\admin\miniconda3\lib\site-packages (from matplotlib==3.5.1->-r requirements.txt (line 5)) (10.0.0)
Requirement already satisfied: pyparsing>=2.2.1 in c:\users\admin\miniconda3\lib\site-packages (from matplotlib==3.5.1->-r requirements.txt (line 5)) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\admin\miniconda3\lib\site-packages (from matplotlib==3.5.1->-r requirements.txt (line 5)) (2.8.2)
Collecting natsort (from pyjanitor->-r requirements.txt (line 2))
  Downloading natsort-8.4.0-py3-none-any.whl (38 kB)
Requirement already satisfied: pandas-flavor in c:\users\admin\miniconda3\lib\site-packages (from pyjanitor->-r requirements.txt (line 2)) (0.6.0)
Collecting multipledispatch (from pyjanitor->-r requirements.txt (line 2))
  Downloading multipledispatch-1.0.0-py3-none-any.whl (12 kB)
Requirement already satisfied: scipy in c:\users\admin\miniconda3\lib\site-packages (from pyjanitor->-r requirements.txt (line 2)) (1.11.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\admin\miniconda3\lib\site-packages (from pandas->-r requirements.txt (line 6)) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\admin\miniconda3\lib\site-packages (from pandas->-r requirements.txt (line 6)) (2023.3)
Collecting stdlib_list (from session-info->-r requirements.txt (line 9))
  Downloading stdlib_list-0.9.0-py3-none-any.whl (75 kB)
                                              0.0/75.6 kB ? eta -:--:--
     ---------------------------------------- 75.6/75.6 kB 2.1 MB/s eta 0:00:00
Requirement already satisfied: six>=1.5 in c:\users\admin\miniconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib==3.5.1->-r requirements.txt (line 5)) (1.16.0)
Requirement already satisfied: xarray in c:\users\admin\miniconda3\lib\site-packages (from pandas-flavor->pyjanitor->-r requirements.txt (line 2)) (2023.8.0)
Building wheels for collected packages: matplotlib, upsetplot, session-info
  Building wheel for matplotlib (setup.py): started

Note: you may need to restart the kernel to use updated packages.

Importar librerías

import janitor
import matplotlib.pyplot as plt
import missingno
import numpy as np
import pandas as pd
import pyreadr
import seaborn as sns
import session_info
import upsetplot

Importar funciones personalizadas#

%run pandas-missing-extension.ipynb

Configurar el aspecto general de las gráficas del proyecto

%matplotlib inline

sns.set(
    rc={
        "figure.figsize": (10, 10)
    }
)

sns.set_style("whitegrid")

Cargar los conjuntos de datos#

Pima Indians Diabetes#

# Saving the link into a variable
pima_indians_diabetes_url = "https://nrvis.com/data/mldata/pima-indians-diabetes.csv"

# Downloading the csv file
!wget -O ./data/pima-indians-diabetes.csv { pima_indians_diabetes_url } -q

diabetes_df = pd.read_csv(
    "./data/pima-indians-diabetes.csv",
    sep=",",
    names=[
        "pregnancies",
        "glucose",
        "blood_pressure",
        "skin_thickness",
        "insulin",
        "bmi",
        "diabetes_pedigree_function",
        "age",
        "outcome",
    ]
)

diabetes_df.head()

	pregnancies	glucose	blood_pressure	skin_thickness	insulin	bmi	diabetes_pedigree_function	age	outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

naniar (oceanbuoys, pedestrian, riskfactors)#

Crear unidades de información de los conjuntos de datos#

base_url = "https://github.com/njtierney/naniar/raw/master/data/"
dataset_names = ("oceanbuoys", "pedestrian", "riskfactors")
extension = ".rda"

Descargar y cargar los conjuntos de datos#

datasets_dfs = {}

for dataset_name in dataset_names:

    dataset_file = f"{ dataset_name }{ extension }"
    dataset_output_file = f"./data/{ dataset_file }"
    dataset_url = f"{ base_url }{ dataset_file }"

    !wget -q -O { dataset_output_file } { dataset_url }

    datasets_dfs[f"{ dataset_name }_df"] = pyreadr.read_r(dataset_output_file).get(dataset_name)

datasets_dfs.keys()

dict_keys(['oceanbuoys_df', 'pedestrian_df', 'riskfactors_df'])

Incluir conjuntos de datos en nuestro ambiente local#

locals().update(**datasets_dfs)
del datasets_dfs

Verificar carga#

oceanbuoys_df.shape, pedestrian_df.shape, riskfactors_df.shape, diabetes_df.shape

((736, 8), (37700, 9), (245, 34), (768, 9))

Note que la variable state tiene 245 valores no nulos, mientras que la variable pregnant tiene 30

la variable pregnant parece tener muchos valores no nulos, vamos a investigar?

riskfactors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 34 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 state             245 non-null    category
 sex               245 non-null    category
 age               245 non-null    int32   
 weight_lbs        235 non-null    object  
 height_inch       243 non-null    object  
 bmi               234 non-null    float64 
 marital           244 non-null    category
 pregnant          30 non-null     category
 children          245 non-null    int32   
 education         244 non-null    category
employment        245 non-null    category
income            245 non-null    category
veteran           242 non-null    category
hispanic          243 non-null    category
health_general    245 non-null    category
health_physical   245 non-null    int32   
health_mental     245 non-null    int32   
health_poor       132 non-null    object  
health_cover      245 non-null    category
provide_care      242 non-null    category
activity_limited  242 non-null    category
drink_any         243 non-null    category
drink_days        111 non-null    object  
drink_average     110 non-null    object  
smoke_100         243 non-null    category
smoke_days        117 non-null    category
smoke_stop        33 non-null     category
smoke_last        84 non-null     category
diet_fruit        237 non-null    object  
diet_salad        237 non-null    object  
diet_potato       237 non-null    object  
diet_carrot       237 non-null    object  
diet_vegetable    237 non-null    object  
diet_juice        237 non-null    object  
dtypes: category(18), float64(1), int32(4), object(11)
memory usage: 36.7+ KB

Tabulación de valores faltantes#

Número total de valores completos (sin observaciones faltantes)#

riskfactors_df.missing.number_complete()

Número total de valores faltantes#

riskfactors_df.missing.number_missing()

Resúmenes tabulares de valores faltantes#

Variables / Columnas#

Resumen por variable#

riskfactors_df.missing.missing_variable_summary()

	variable	n_missing	n_cases	pct_missing
0	state	0	245	0.000000
1	sex	0	245	0.000000
2	age	0	245	0.000000
3	weight_lbs	10	245	4.081633
4	height_inch	2	245	0.816327
5	bmi	11	245	4.489796
6	marital	1	245	0.408163
7	pregnant	215	245	87.755102
8	children	0	245	0.000000
9	education	1	245	0.408163
10	employment	0	245	0.000000
11	income	0	245	0.000000
12	veteran	3	245	1.224490
13	hispanic	2	245	0.816327
14	health_general	0	245	0.000000
15	health_physical	0	245	0.000000
16	health_mental	0	245	0.000000
17	health_poor	113	245	46.122449
18	health_cover	0	245	0.000000
19	provide_care	3	245	1.224490
20	activity_limited	3	245	1.224490
21	drink_any	2	245	0.816327
22	drink_days	134	245	54.693878
23	drink_average	135	245	55.102041
24	smoke_100	2	245	0.816327
25	smoke_days	128	245	52.244898
26	smoke_stop	212	245	86.530612
27	smoke_last	161	245	65.714286
28	diet_fruit	8	245	3.265306
29	diet_salad	8	245	3.265306
30	diet_potato	8	245	3.265306
31	diet_carrot	8	245	3.265306
32	diet_vegetable	8	245	3.265306
33	diet_juice	8	245	3.265306

Tabulación del resumen por variable#

riskfactors_df.missing.missing_variable_table()

	n_missing_in_variable	n_variables	pct_variables
0	0	10	29.411765
1	8	6	17.647059
2	2	4	11.764706
3	3	3	8.823529
4	1	2	5.882353
5	10	1	2.941176
6	11	1	2.941176
7	113	1	2.941176
8	128	1	2.941176
9	134	1	2.941176
10	135	1	2.941176
11	161	1	2.941176
12	212	1	2.941176
13	215	1	2.941176

Casos / Observaciones / Filas#

Resúmenes por caso (muestra valores nulos por row)#

riskfactors_df.missing.missing_case_summary()

	case	n_missing	pct_missing
0	0	6	16.666667
1	1	6	16.666667
2	2	7	19.444444
3	3	12	33.333333
4	4	5	13.888889
...	...	...	...
240	240	6	16.666667
241	241	5	13.888889
242	242	3	8.333333
243	243	2	5.555556
244	244	3	8.333333

245 rows × 3 columns

Tabulación del resumen por caso#

riskfactors_df.missing.missing_case_table()

	n_missing_in_case	n_cases	pct_case
0	4	49	20.000000
1	5	45	18.367347
2	7	39	15.918367
3	6	36	14.693878
4	2	31	12.653061
5	3	30	12.244898
6	1	4	1.632653
7	8	3	1.224490
8	12	3	1.224490
9	15	2	0.816327
10	9	1	0.408163
11	10	1	0.408163
12	11	1	0.408163

Intervalos de valores faltantes#

(
    riskfactors_df
    .missing
    .missing_variable_span(
        variable="weight_lbs",            # variable a analizar
        span_every=50                     # intervalos a romper la variable
    )
)

	span_counter	n_missing	n_complete	pct_missing	pct_complete
0	0	1	49	2.000000	98.000000
1	1	5	45	10.000000	90.000000
2	2	1	49	2.000000	98.000000
3	3	1	49	2.000000	98.000000
4	4	2	43	4.444444	95.555556

Run length de valores faltantes#

Detecta automáticamente el conteo seguido de variables nulas

(
    riskfactors_df
    .missing
    .missing_variable_run(
        variable="weight_lbs"
    )
)

	run_length	is_na
0	14	complete
1	1	missing
2	45	complete
3	1	missing
4	5	complete
5	1	missing
6	12	complete
7	1	missing
8	10	complete
9	2	missing
10	40	complete
11	1	missing
12	53	complete
13	1	missing
14	14	complete
15	1	missing
16	31	complete
17	1	missing
18	11	complete

Visualización inicial de valores faltantes#

Variable#

riskfactors_df.missing.missing_variable_plot()

../_images/3ca16b741ffe0ed4edc2efd83cd04f51bc55748b7790d656cafeccb76f95607c.png

Casos / Observaciones / Filas#

riskfactors_df.missing.missing_case_plot()

../_images/4fdcf5b1027788b2ab6cf881c7470a1db50b6c77ccb735594776e5c1f1b6fe5b.png

(
    riskfactors_df
    .missing
    .missing_variable_span_plot(
        variable="weight_lbs",
        span_every=10,
        rot = 0     # label rotation
    )
)

../_images/980480ca24cc0dc41d392dd17a04d1925d9a1e669f8f57a5b1d45330c6fb0c1c.png

Que tan completas están nuestras variables?

missingno.bar(df = riskfactors_df);

../_images/304f00ea5f37ac585ea8b5421411a93739ad9d33e32a7594571e32f8571748ab.png

missingno.matrix(df=riskfactors_df);

/root/venv/lib/python3.9/site-packages/missingno/missingno.py:73: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:142: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax1.grid(b=False)

../_images/05dd59cc34d69b7783a9ab064ea918e75e5860fc612b8d33f5971b7974eb59c4.png

(
    riskfactors_df
    .missing
    .missing_upsetplot(
        variables=["pregnant","weight_lbs","smoke_stop"],  # Put the target variable name
        # alternatively, put none to see all variables
        
        element_size = 60    # This one specifies the size of the figure that you want to create
    )
);

../_images/99f5709fbaa17f12b6567142af42744f444dca5ca4b087c8a78ec5c62e60dc6b.png

Codificación de valores faltantes#

🚧 Advertencia

Al igual que cada persona es una nueva puerta a un mundo diferente, los valores faltantes existen en diferentes formas y colores. Al trabajar con valores faltantes será crítico entender sus distintas representaciones. A pesar de que el conjunto de datos de trabajo pareciera que no contiene valores faltantes, deberás ser capaz de ir más allá de lo observado a simple vista para remover el manto tras el cual se esconde lo desconocido.

Valores comúnmente asociados a valores faltantes#

Cadenas de texto#

common_na_strings = (
    "missing",
    "NA",
    "N A",
    "N/A",
    "#N/A",
    "NA ",
    " NA",
    "N /A",
    "N / A",
    " N / A",
    "N / A ",
    "na",
    "n a",
    "n/a",
    "na ",
    " na",
    "n /a",
    "n / a",
    " a / a",
    "n / a ",
    "NULL",
    "null",
    "",
    "?",
    "*",
    ".",
)

Números#

common_na_numbers = (-9, -99, -999, -9999, 9999, 66, 77, 88, -1)

¿Cómo encontrar los valores comúnmente asociados a valores faltantes?#

missing_data_example_df = pd.DataFrame.from_dict(
    dict(
        x = [1, 3, "NA", -99, -98, -99],
        y = ["A", "N/A", "NA", "E", "F", "G"],
        z = [-100, -99, -98, -101, -1, -1]
    )
)

missing_data_example_df

	x	y	z
0	1	A	-100
1	3	N/A	-99
2	NA	NA	-98
3	-99	E	-101
4	-98	F	-1
5	-99	G	-1

Ya sabemos que hay valores faltantes

Será que Pandas los detecta?

missing_data_example_df.isna()

	x	y	z
0	False	False	False
1	False	False	False
2	False	False	False
3	False	False	False
4	False	False	False
5	False	False	False

Sustituyendo valores comúnmente asociados a valores faltantes#

Sustitución desde la lectura de datos#

pd.read_csv(
    "./data/missing_data_enconding_example.csv",
    na_filter=True,
    na_values=[-99, -1]
)

	x	y	z
0	1.0	A	-100.0
1	3.0	NaN	NaN
2	NaN	NaN	-98.0
3	NaN	E	-101.0
4	-98.0	F	NaN
5	NaN	G	NaN

Sustitución global#

(
    missing_data_example_df
    .replace(
        to_replace=[-99, "NA"],
        value=np.nan
    )
)

	x	y	z
0	1.0	A	-100.0
1	3.0	N/A	NaN
2	NaN	NaN	-98.0
3	NaN	E	-101.0
4	-98.0	F	-1.0
5	NaN	G	-1.0

Sustitución dirigida#

Sustituir solamente en una columna

(
    missing_data_example_df
    .replace(
        to_replace={
            "x":{-99:np.nan}
        }
    )
)

	x	y	z
0	1	A	-100
1	3	N/A	-99
2	NA	NA	-98
3	NaN	E	-101
4	-98	F	-1
5	NaN	G	-1

Conversión de valores faltantes implícitos a explícitos#

🚧 Advertencia

"Implícito se refiere a todo aquello que se entiende que está incluido pero sin ser expresado de forma directa o explícitamente."

Un valor faltante implícito indica que el valor faltante debería estar incluido en el conjunto de datos del análisis, sin que éste lo diga o lo especifique. Por lo general, son valores que podemos encontrar al pivotar nuestros datos o contabilizar el número de apariciones de combinaciones de las variables de estudio.

implicit_to_explicit_df = pd.DataFrame.from_dict(
    data={
        "name": ["lynn", "lynn", "lynn", "zelda"],
        "time": ["morning", "afternoon", "night", "morning"],
        "value": [350, 310, np.nan, 320]
    }
)

implicit_to_explicit_df

	name	time	value
0	lynn	morning	350.0
1	lynn	afternoon	310.0
2	lynn	night	NaN
3	zelda	morning	320.0

Note que implicitamente hay valores faltantes, por que zelda no tiene valores para afternoon & night?

Estrategias para la identificación de valores faltantes implícitos#

Pivotar la tabla de datos#

(
    implicit_to_explicit_df
    .pivot_wider(
        index="name",
        names_from="time",
        values_from="value"
    )
)

	name	afternoon	morning	night
0	lynn	310.0	350.0	NaN
1	zelda	NaN	320.0	NaN

Cuantificar ocurrencias de n-tuplas#

(
    implicit_to_explicit_df
    .value_counts(
        subset=["name"]
    )
    .reset_index(name="n")
    .query("n<3")
)

	name	n
1	zelda	1

Exponer filas faltantes implícitas a explícitas#

📘 Información

janitor.complete() está modelada a partir de la función complete() del paquete tidyr y es un wrapper alrededor de janitor.expand_grid(), pd.merge() y pd.fillna(). En cierto modo, es lo contrario de pd.dropna(), ya que expone implícitamente las filas que faltan.

Son posibles combinaciones de nombres de columnas o una lista/tupla de nombres de columnas, o incluso un diccionario de nombres de columna y nuevos valores.

Las columnas MultiIndex no son complatibles.

Exponer n-tuplas de valores faltantes#

Ejemplo, encontrar los pares faltantes de name y time.

(
    implicit_to_explicit_df

    #janitor
    .complete(
        "name",
        "time"
    )
)

	name	time	value
0	lynn	morning	350.0
1	lynn	afternoon	310.0
2	lynn	night	NaN
3	zelda	morning	320.0
4	zelda	afternoon	NaN
5	zelda	night	NaN

Limitar la exposición de n-tuplas de valores faltantes#

(
    implicit_to_explicit_df

    #janitor
    .complete(
        {"name":["lynn", "zelda"]},
        {"time":["morning", "afternoon"]},
        sort=True
    )
)

	name	time	value
0	lynn	afternoon	310.0
1	lynn	morning	350.0
2	zelda	afternoon	NaN
3	zelda	morning	320.0
4	lynn	night	NaN

Rellenar los valores faltantes#

(
    implicit_to_explicit_df

    #janitor
    .complete(
        "name",
        "time",
        fill_value = np.nan    #aqui se puede poner cualquier valor para los nulos
    )
)

	name	time	value
0	lynn	morning	350.0
1	lynn	afternoon	310.0
2	lynn	night	NaN
3	zelda	morning	320.0
4	zelda	afternoon	NaN
5	zelda	night	NaN

Limitar el rellenado de valores faltantes implícitos#

(
    implicit_to_explicit_df

    #janitor
    .complete(
        "name",
        "time",
        fill_value = 0,
        explicit=False
    )
)

	name	time	value
0	lynn	morning	350.0
1	lynn	afternoon	310.0
2	lynn	night	NaN
3	zelda	morning	320.0
4	zelda	afternoon	0.0
5	zelda	night	0.0

Tipos de valores faltantes#

Usando el DF de diabetes, determine si hay valores faltantes

diabetes_df.missing.missing_variable_plot()

../_images/8a6d5244a2199ce1f136b8b63f53e68c18644a76480d1677d035d728cc045cff.png

This chart is basically saying that there are no missing values, which is false. The problem is that the missing values are in a different format

The missing values are shown as a zero, let’s change them into numpy nan

diabetes_df[diabetes_df.columns[1:6]] = diabetes_df[diabetes_df.columns[1:6]].replace(0, np.nan)
diabetes_df.missing.missing_variable_plot()

../_images/bbe383f54e893b7e4f076b47ad192fb20b03df47f54737448e3a4d8a35eeca34.png

Missing Completely At Random (MCAR)#

(
    diabetes_df
    .missing.sort_variables_by_missingness()
    .pipe(missingno.matrix)
)

/root/venv/lib/python3.9/site-packages/missingno/missingno.py:73: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:142: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax1.grid(b=False)

<AxesSubplot:>

../_images/cafb12e19bf62a7836f1d4bf504634437ad29ce91ecddb8f9e1b874b8dd6fd15.png

Se ordenan las columnas tal que las que tienen mas valores faltantes aparecen primero.

Missing At Random (MAR)#

(
    diabetes_df
    .missing.sort_variables_by_missingness()
    .sort_values(by="blood_pressure")
    .pipe(missingno.matrix)
)

/root/venv/lib/python3.9/site-packages/missingno/missingno.py:73: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:142: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax1.grid(b=False)

<AxesSubplot:>

../_images/839f2d15fb0c315f708bc3426710429f07383e3a7bad654bb6238f4b87c3f0cc.png

Aqui se puede empezar a inferir que puede haber una relacion entre la presion sanguinea y los datos faltantes

Missing Not At Random (MNAR)#

(
    diabetes_df
    .missing.sort_variables_by_missingness()
    .sort_values(by="insulin")
    .pipe(missingno.matrix)
)

/root/venv/lib/python3.9/site-packages/missingno/missingno.py:73: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:142: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax1.grid(b=False)

<AxesSubplot:>

../_images/d6d1bfeb6cbb76566e61b30b5401b362fccf6a59ecd93f915351897e9a6ecba5.png

Aqui podemos observar que este error es sistematico. Siempre que se colectan los datos de insulina, tambien se colectan el resto de datos. A su vez, cuando no se colectan los datos de insulina, tienden a haber datos faltantes en las otras variables

Concepto y aplicación de la matriz de sombras (i.e., shadow matrix)#

Construcción de la matriz de sombras#

(
    riskfactors_df
    .isna()
    .replace({
        False:"Not missing",
        True:"Missing"
    })
    .add_suffix("_NA")
    .pipe(
        lambda shadow_matrix: pd.concat(
            [riskfactors_df, shadow_matrix],
            axis="columns"
        )
    )
)

	state	sex	age	weight_lbs	height_inch	bmi	marital	pregnant	children	education	...	smoke_100_NA	smoke_days_NA	smoke_stop_NA	smoke_last_NA	diet_fruit_NA	diet_salad_NA	diet_potato_NA	diet_carrot_NA	diet_vegetable_NA	diet_juice_NA
0	26	Female	49	190	64	32.68	Married	NaN	0	6	...	Not missing	Missing	Missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
1	40	Female	48	170	68	25.90	Divorced	NaN	0	5	...	Not missing	Missing	Missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
2	72	Female	55	163	64	28.04	Married	NaN	0	4	...	Not missing	Missing	Missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
3	42	Male	42	230	74	29.59	Married	NaN	1	6	...	Not missing	Missing	Missing	Missing	Missing	Missing	Missing	Missing	Missing	Missing
4	32	Female	66	135	62	24.74	Widowed	NaN	0	5	...	Not missing	Not missing	Not missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
240	10	Female	79	144	63	25.56	Widowed	NaN	0	4	...	Not missing	Missing	Missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
241	46	Male	45	170	74	21.87	Divorced	NaN	2	4	...	Not missing	Missing	Missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
242	15	Male	62	175	71	24.46	Divorced	NaN	0	6	...	Not missing	Not missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
243	34	Female	62	138	64	23.74	Married	NaN	0	4	...	Not missing	Not missing	Not missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing
244	18	Male	9	200	70	28.76	Married	NaN	0	4	...	Not missing	Not missing	Missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing	Not missing

245 rows × 68 columns

Utilizar función de utilería `bind_shadow_matrix()`#

(
    riskfactors_df
    .missing
    .bind_shadow_matrix(only_missing=True)   #Only missing=true will just concat cols with missing values
)

	state	sex	age	weight_lbs	height_inch	bmi	marital	pregnant	children	education	...	smoke_100_NA	smoke_days_NA	smoke_stop_NA	smoke_last_NA	diet_fruit_NA	diet_salad_NA	diet_potato_NA	diet_carrot_NA	diet_vegetable_NA	diet_juice_NA
0	26	Female	49	190	64	32.68	Married	NaN	0	6	...	Not Missing	Missing	Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
1	40	Female	48	170	68	25.90	Divorced	NaN	0	5	...	Not Missing	Missing	Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
2	72	Female	55	163	64	28.04	Married	NaN	0	4	...	Not Missing	Missing	Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
3	42	Male	42	230	74	29.59	Married	NaN	1	6	...	Not Missing	Missing	Missing	Missing	Missing	Missing	Missing	Missing	Missing	Missing
4	32	Female	66	135	62	24.74	Widowed	NaN	0	5	...	Not Missing	Not Missing	Not Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
240	10	Female	79	144	63	25.56	Widowed	NaN	0	4	...	Not Missing	Missing	Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
241	46	Male	45	170	74	21.87	Divorced	NaN	2	4	...	Not Missing	Missing	Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
242	15	Male	62	175	71	24.46	Divorced	NaN	0	6	...	Not Missing	Not Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
243	34	Female	62	138	64	23.74	Married	NaN	0	4	...	Not Missing	Not Missing	Not Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing
244	18	Male	9	200	70	28.76	Married	NaN	0	4	...	Not Missing	Not Missing	Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing	Not Missing

245 rows × 58 columns

Explorar estadísticos utilizando las nuevas columnas de la matriz de sombras#

(
    riskfactors_df
    .missing
    .bind_shadow_matrix(only_missing=True)
    .groupby(["weight_lbs_NA"])
    ["age"]
    .describe()
    .reset_index()
)

	weight_lbs_NA	count	mean	std	min	25%	50%	75%	max
0	Missing	10.0	60.100000	13.706851	37.0	52.25	62.5	65.0	82.0
1	Not Missing	235.0	58.021277	17.662904	7.0	47.50	59.0	70.0	97.0

Visualización de valores faltantes en una variable#

(
    riskfactors_df
    .missing
    .bind_shadow_matrix(only_missing=True)
    .pipe(
        lambda df:(
            sns.boxenplot(
                data=df,
                x="weight_lbs_NA",
                y="age"
            )
        )
    )
)

<AxesSubplot:xlabel='weight_lbs_NA', ylabel='age'>

../_images/95ca3995e220d17a478fa9517531ae7a6866746871831d2e8022de836e73e8a8.png

(
    riskfactors_df
    .missing
    .bind_shadow_matrix(only_missing=True)
    .pipe(
        lambda df:(
            sns.displot(
                data=df,
                x="age",
                hue="weight_lbs_NA",
                kind="kde"
            )
        )
    )
)

<seaborn.axisgrid.FacetGrid at 0x7f3f7c0a0670>

../_images/149966d6961938f90467e0921d6e158cb5a84dbc4443c889eb01c6a5c20f5815.png

(
    riskfactors_df
    .missing
    .bind_shadow_matrix(only_missing=True)
    .pipe(
        lambda df:(
            sns.displot(
                data=df,
                x="age",
                col="weight_lbs_NA",
                facet_kws={
                    "sharey":False
                }
            )
        )
    )
)

<seaborn.axisgrid.FacetGrid at 0x7f3f7c121c70>

../_images/c4be92f223bed4a95240ad8c5f73d127837ff5f92d43056f748cc75a4d89938b.png

(
    riskfactors_df
    .missing
    .bind_shadow_matrix(only_missing=True)
    .pipe(
        lambda df:(
            sns.displot(
                data=df,
                x="age",
                col="marital_NA",
                row="weight_lbs_NA"
            )
        )
    )
)

<seaborn.axisgrid.FacetGrid at 0x7f3f755a59a0>

../_images/1acad86be993cc264811f04543774667886d1d5e2f26eb5d288294f3e78f83eb.png

Visualización de valores faltantes en dos variables#

def column_fill_with_dummies(
    column: pd.Series,
    proportion_below: float=0.10,
    jitter: float=0.075,
    seed: int=42
) -> pd.Series:

    column = column.copy(deep=True)

    # Extract values metadata
    missing_mask = column.isna()
    number_missing_values = missing_mask.sum()
    column_range = column.max() - column.min()

    # Shift data
    column_shift = column.min() - column.min() * proportion_below

    # Create the "jitter" (noise) to be added around the points.
    np.random.seed(seed)
    column_jitter = (np.random.rand(number_missing_values) - 2) * column_range * jitter

    # Save new dummy data.
    column[missing_mask] = column_shift + column_jitter

    return column

(
    riskfactors_df
    .select_dtypes(
        exclude="category"
    )
    .pipe(
        lambda df: df[df.columns[df.isna().any()]]
    )
    .missing.bind_shadow_matrix(true_string = True, false_string = False)
    .apply(
        lambda column: column if "_NA" in column.name else column_fill_with_dummies(column, proportion_below=0.05, jitter = 0.075)
    )
    .assign(
        nullity = lambda df: df.weight_lbs_NA | df.height_inch_NA
    )
    .pipe(
        lambda df: (
            sns.scatterplot(
                data = df,
                x = "weight_lbs",
                y = "height_inch",
                hue = "nullity"
            )
        )
    )
)

<AxesSubplot:xlabel='weight_lbs', ylabel='height_inch'>

../_images/466f620c95b7991179792537b195afbec0224e2e45937cf6e0d39bdfe8a75946.png

Correlación de nulidad#

missingno.heatmap(
    df=riskfactors_df
)

/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/matrix.py:306: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.
  mesh = ax.pcolormesh(self.plot_data, cmap=self.cmap, **kws)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/matrix.py:316: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.
  cb = ax.figure.colorbar(mesh, cax, ax, **self.cbar_kws)

<AxesSubplot:>

../_images/cbbef9e5dd89325d77ffe01d3a4dba340716b146a2453372665f48cee19be59d.png

missingno.dendrogram(
    df=riskfactors_df
)

/root/venv/lib/python3.9/site-packages/missingno/missingno.py:475: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)

<AxesSubplot:>

../_images/56d0f6b65749d8c633426e49eb038d94ddedbea0ab352ae2a09f7db5a3553a91.png

Eliminación de valores faltantes#

🚧 Advertencia

La eliminación de valores faltantes asume que los valores faltantes están perdidos completamente al azar (MCAR). En cualquier otro caso, realizar una eliminación de valores faltantes podrá ocasionar sesgos en los análisis y modelos subsecuentes.

Primero observa el número total de observaciones y variables que tiene tu conjunto de datos.

riskfactors_df.shape

(245, 34)

Pairwise deletion (eliminación por pares)#

Esto consiste en ignorar valores nulos para realziar operaciones como calculos de promedio, conteos, etc

riskfactors_df.weight_lbs.size, riskfactors_df.weight_lbs.count()

(245, 235)

En este caso, en size se cuentan todos los datos. Mientras que en count() solo se cuentan los datos no nulos

Listwise Deletion or Complete Case (Eliminación por lista o caso completo)#

Este procedimiento consiste en borrar todo el registro (o fila) si es que hay un valor faltante

Con base en 1 columna#

(
    riskfactors_df
    .dropna(
        subset=["weight_lbs"],
        how="any"
    )
)

	state	sex	age	weight_lbs	height_inch	bmi	marital	pregnant	children	education	...	smoke_100	smoke_days	smoke_stop	smoke_last	diet_fruit	diet_salad	diet_potato	diet_carrot	diet_vegetable	diet_juice
0	26	Female	49	190	64	32.68	Married	NaN	0	6	...	No	NaN	NaN	NaN	1095	261	104	156	521	12
1	40	Female	48	170	68	25.90	Divorced	NaN	0	5	...	No	NaN	NaN	NaN	52	209	52	0	52	0
2	72	Female	55	163	64	28.04	Married	NaN	0	4	...	No	NaN	NaN	NaN	36	156	52	24	24	24
3	42	Male	42	230	74	29.59	Married	NaN	1	6	...	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	32	Female	66	135	62	24.74	Widowed	NaN	0	5	...	Yes	Everyday	Yes	NaN	-7	261	209	261	365	104
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
240	10	Female	79	144	63	25.56	Widowed	NaN	0	4	...	No	NaN	NaN	NaN	-7	-7	-7	-7	-7	-7
241	46	Male	45	170	74	21.87	Divorced	NaN	2	4	...	No	NaN	NaN	NaN	52	52	52	24	52	24
242	15	Male	62	175	71	24.46	Divorced	NaN	0	6	...	Yes	Not@All	NaN	7	365	156	104	52	730	365
243	34	Female	62	138	64	23.74	Married	NaN	0	4	...	Yes	Everyday	No	NaN	730	0	24	156	104	0
244	18	Male	9	200	70	28.76	Married	NaN	0	4	...	Yes	Not@All	NaN	7	52	104	52	0	104	104

235 rows × 34 columns

Se borraron los registros si weight_lbs esta faltando

Con base en 2 o más columnas#

(
    riskfactors_df
    .dropna(
        subset=["weight_lbs", "height_inch"],
        how="any"
    )
)

	state	sex	age	weight_lbs	height_inch	bmi	marital	pregnant	children	education	...	smoke_100	smoke_days	smoke_stop	smoke_last	diet_fruit	diet_salad	diet_potato	diet_carrot	diet_vegetable	diet_juice
0	26	Female	49	190	64	32.68	Married	NaN	0	6	...	No	NaN	NaN	NaN	1095	261	104	156	521	12
1	40	Female	48	170	68	25.90	Divorced	NaN	0	5	...	No	NaN	NaN	NaN	52	209	52	0	52	0
2	72	Female	55	163	64	28.04	Married	NaN	0	4	...	No	NaN	NaN	NaN	36	156	52	24	24	24
3	42	Male	42	230	74	29.59	Married	NaN	1	6	...	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	32	Female	66	135	62	24.74	Widowed	NaN	0	5	...	Yes	Everyday	Yes	NaN	-7	261	209	261	365	104
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
240	10	Female	79	144	63	25.56	Widowed	NaN	0	4	...	No	NaN	NaN	NaN	-7	-7	-7	-7	-7	-7
241	46	Male	45	170	74	21.87	Divorced	NaN	2	4	...	No	NaN	NaN	NaN	52	52	52	24	52	24
242	15	Male	62	175	71	24.46	Divorced	NaN	0	6	...	Yes	Not@All	NaN	7	365	156	104	52	730	365
243	34	Female	62	138	64	23.74	Married	NaN	0	4	...	Yes	Everyday	No	NaN	730	0	24	156	104	0
244	18	Male	9	200	70	28.76	Married	NaN	0	4	...	Yes	Not@All	NaN	7	52	104	52	0	104	104

234 rows × 34 columns

Se borraron los registros si weight_lbs y height_inch estan faltando

(
    riskfactors_df
    .dropna(
        subset=["weight_lbs", "height_inch"],
        how="all"
    )
)

	state	sex	age	weight_lbs	height_inch	bmi	marital	pregnant	children	education	...	smoke_100	smoke_days	smoke_stop	smoke_last	diet_fruit	diet_salad	diet_potato	diet_carrot	diet_vegetable	diet_juice
0	26	Female	49	190	64	32.68	Married	NaN	0	6	...	No	NaN	NaN	NaN	1095	261	104	156	521	12
1	40	Female	48	170	68	25.90	Divorced	NaN	0	5	...	No	NaN	NaN	NaN	52	209	52	0	52	0
2	72	Female	55	163	64	28.04	Married	NaN	0	4	...	No	NaN	NaN	NaN	36	156	52	24	24	24
3	42	Male	42	230	74	29.59	Married	NaN	1	6	...	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	32	Female	66	135	62	24.74	Widowed	NaN	0	5	...	Yes	Everyday	Yes	NaN	-7	261	209	261	365	104
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
240	10	Female	79	144	63	25.56	Widowed	NaN	0	4	...	No	NaN	NaN	NaN	-7	-7	-7	-7	-7	-7
241	46	Male	45	170	74	21.87	Divorced	NaN	2	4	...	No	NaN	NaN	NaN	52	52	52	24	52	24
242	15	Male	62	175	71	24.46	Divorced	NaN	0	6	...	Yes	Not@All	NaN	7	365	156	104	52	730	365
243	34	Female	62	138	64	23.74	Married	NaN	0	4	...	Yes	Everyday	No	NaN	730	0	24	156	104	0
244	18	Male	9	200	70	28.76	Married	NaN	0	4	...	Yes	Not@All	NaN	7	52	104	52	0	104	104

244 rows × 34 columns

Se borraron los registros si weight_lbs o height_inch estan faltando

Representación gráfica tras la eliminación de los valores faltantes#

(
    riskfactors_df
    .dropna(
        subset=["weight_lbs", "height_inch"],
        how="any"
    )
    .select_columns(["weight_lbs", "height_inch"])
    .pipe(missingno.matrix)
)

/root/venv/lib/python3.9/site-packages/missingno/missingno.py:73: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax0.grid(b=False)
/root/venv/lib/python3.9/site-packages/missingno/missingno.py:142: MatplotlibDeprecationWarning: The 'b' parameter of grid() has been renamed 'visible' since Matplotlib 3.5; support for the old name will be dropped two minor releases later.
  ax1.grid(b=False)

<AxesSubplot:>

../_images/3a6374bf11ce02213a3c5cea61a964fd187b7ff82264784acca741e6c5ab7f34.png

Se confirma en el grafico que no faltan datos

Imputación básica de valores faltantes#

Imputacion es reemplazar los datos faltantes por datos con sentido, puedes basarte en:

algun estadistico que venga de los datos
un valor por contexto
un valor extraido de un modelo de machine learning

Imputación con base en el contexto#

implicit_to_explicit_df = pd.DataFrame(
    data={
        "name": ["lynn", np.nan, "zelda", np.nan, "shadowsong", np.nan],
        "time": ["morning", "afternoon", "morning", "afternoon", "morning", "afternoon",],
        "value": [350, 310, 320, 350, 310, 320]
    }
)

implicit_to_explicit_df

	name	time	value
0	lynn	morning	350
1	NaN	afternoon	310
2	zelda	morning	320
3	NaN	afternoon	350
4	shadowsong	morning	310
5	NaN	afternoon	320

En este caso, es obvio que hay que poner el mismo nombre de la fila previa en el valor faltante, eso es contexto.

Como se rellena de manera rapida?

implicit_to_explicit_df.ffill()

	name	time	value
0	lynn	morning	350
1	lynn	afternoon	310
2	zelda	morning	320
3	zelda	afternoon	350
4	shadowsong	morning	310
5	shadowsong	afternoon	320

Imputación de un único valor#

(
    riskfactors_df
    .select_columns("weight_lbs", "height_inch", "bmi")
    .missing.bind_shadow_matrix(true_string = True, false_string = False)
    .apply(
        axis = "rows",
        func = lambda column : column.fillna(column.mean()) if "_NA" not in column.name else column
        # Basicamente se acaba de llenar el valor faltante con el promedio de su respectiva columna
    )
)

	weight_lbs	height_inch	bmi	weight_lbs_NA	height_inch_NA	bmi_NA
0	190.0	64.0	32.68	False	False	False
1	170.0	68.0	25.90	False	False	False
2	163.0	64.0	28.04	False	False	False
3	230.0	74.0	29.59	False	False	False
4	135.0	62.0	24.74	False	False	False
...	...	...	...	...	...	...
240	144.0	63.0	25.56	False	False	False
241	170.0	74.0	21.87	False	False	False
242	175.0	71.0	24.46	False	False	False
243	138.0	64.0	23.74	False	False	False
244	200.0	70.0	28.76	False	False	False

245 rows × 6 columns

Basicamente se acaba de llenar el valor faltante con el promedio de su respectiva columna

visualizando ese procedimiento:

(
    riskfactors_df
    .select_columns("weight_lbs", "height_inch", "bmi")
    .missing.bind_shadow_matrix(true_string = True, false_string = False)
    .apply(
        axis = "rows",
        func = lambda column : column.fillna(column.mean()) if "_NA" not in column.name else column
        # Basicamente se acaba de llenar el valor faltante con el promedio de su respectiva columna
    )
    .pipe(
        lambda df: (
            sns.displot(
                data=df,
                x="weight_lbs",
                hue="weight_lbs_NA"
            )
        )
    )
)

<seaborn.axisgrid.FacetGrid at 0x7f3faf6c0ca0>

../_images/2dcaafcf3551529f5e0a3892e5174b4485df8f0ce99adf0d8fd2fea53e2e5857.png

Se confirma que los valores faltantes ahora corresponden al promedio.

Scatterplot con valores imputados#

(
    riskfactors_df
    .select_columns("weight_lbs", "height_inch", "bmi")
    .missing.bind_shadow_matrix(true_string = True, false_string = False)
    .apply(
        axis = "rows",
        func = lambda column : column.fillna(column.mean()) if "_NA" not in column.name else column
    )
    .assign(
        imputed=lambda df: df.weight_lbs_NA | df.height_inch_NA
    )
    .pipe(
        lambda df: (
            sns.scatterplot(
                data=df,
                x="weight_lbs",
                y="height_inch",
                hue="imputed"
            )
        )
    )
)

<AxesSubplot:xlabel='weight_lbs', ylabel='height_inch'>

../_images/d8a6d9e5760ff53f1242198a38a145bb39419e2aa78ae09125b6835489cd1125.png

(
    riskfactors_df
    .select_columns("weight_lbs", "height_inch", "bmi")
    .missing.bind_shadow_matrix(true_string = True, false_string = False)
    .apply(
        axis = "rows",
        func = lambda column : column.fillna(column.mean()) if "_NA" not in column.name else column
    )
    .pivot_longer(
        index="*_NA"
    )
    .pivot_longer(
        index=["variable", "value"],
        names_to = "variable_NA",
        values_to = "value_NA"
    )
    .assign(
        valid=lambda df:df.apply(axis="columns", func = lambda column: column.variable in column.variable_NA)
    )
    .query("valid")
    .pipe(
        lambda df: (
            sns.displot(
                data=df,
                x="value",
                hue="value_NA",
                col="variable",
                common_bins=False,
                facet_kws={
                    "sharex": False,
                    "sharey": False
                }
            )
        )
    )
)

<seaborn.axisgrid.FacetGrid at 0x7f3fbf26bd60>

../_images/305984dc76bf9bac4afd8aab851db18fc06675aa1a1d9b44416aef00dc1eb611.png

Created in Deepnote

Manejo de Datos Faltantes: Detección

Contents

Manejo de Datos Faltantes: Detección#

Preparacion de entorno#

Importar funciones personalizadas#

Cargar los conjuntos de datos#

Pima Indians Diabetes#

naniar (oceanbuoys, pedestrian, riskfactors)#

Crear unidades de información de los conjuntos de datos#

Descargar y cargar los conjuntos de datos#

Incluir conjuntos de datos en nuestro ambiente local#

Verificar carga#

Tabulación de valores faltantes#

Número total de valores completos (sin observaciones faltantes)#

Número total de valores faltantes#

Resúmenes tabulares de valores faltantes#

Variables / Columnas#

Resumen por variable#

Tabulación del resumen por variable#

Casos / Observaciones / Filas#

Resúmenes por caso (muestra valores nulos por row)#

Tabulación del resumen por caso#

Intervalos de valores faltantes#

Run length de valores faltantes#

Visualización inicial de valores faltantes#

Variable#

Casos / Observaciones / Filas#

Codificación de valores faltantes#

Valores comúnmente asociados a valores faltantes#

Cadenas de texto#

Números#

¿Cómo encontrar los valores comúnmente asociados a valores faltantes?#

Sustituyendo valores comúnmente asociados a valores faltantes#

Sustitución desde la lectura de datos#

Sustitución global#

Sustitución dirigida#

Conversión de valores faltantes implícitos a explícitos#

Estrategias para la identificación de valores faltantes implícitos#

Pivotar la tabla de datos#

Cuantificar ocurrencias de n-tuplas#

Exponer filas faltantes implícitas a explícitas#

Exponer n-tuplas de valores faltantes#

Limitar la exposición de n-tuplas de valores faltantes#

Rellenar los valores faltantes#

Limitar el rellenado de valores faltantes implícitos#

Tipos de valores faltantes#

Missing Completely At Random (MCAR)#

Missing At Random (MAR)#

Missing Not At Random (MNAR)#

Concepto y aplicación de la matriz de sombras (i.e., shadow matrix)#

Construcción de la matriz de sombras#

Utilizar función de utilería bind_shadow_matrix()#

Explorar estadísticos utilizando las nuevas columnas de la matriz de sombras#

Visualización de valores faltantes en una variable#

Visualización de valores faltantes en dos variables#

Correlación de nulidad#

Eliminación de valores faltantes#

Pairwise deletion (eliminación por pares)#

Listwise Deletion or Complete Case (Eliminación por lista o caso completo)#

Con base en 1 columna#

Con base en 2 o más columnas#

Representación gráfica tras la eliminación de los valores faltantes#

Imputación básica de valores faltantes#

Imputación con base en el contexto#

Imputación de un único valor#

Scatterplot con valores imputados#

Utilizar función de utilería `bind_shadow_matrix()`#