How to Clean Time Series Data in Python

Source: IT Builder News Category: Edge Computing Date: 2026-06-16 17:21:31

Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, it has passed through collection, transmission, and storage, each step a potential source of corruption.

Cleaning time series data is harder than cleaning tabular data because time is a structural constraint. You can't shuffle rows or impute a missing value with a column mean without pulling future data into a past observation. Every cleaning decision has to respect temporal ordering, or it breaks the integrity of everything built on top of it.

This guide walks through the full cleaning pipeline in Python: from raw data arrival to a dataset ready for feature engineering or modelling. We'll cover missing value detection and imputation, outlier identification and treatment, duplicate handling, frequency alignment, noise smoothing, and schema validation, applied to sample sensor data throughout.

You can get the Colab notebook from GitHub and follow along.

Prerequisites

To follow along to this guide, you'll need to be:

Comfortable working with Python and pandas DataFrames
Familiar with time-indexed data
Aware of what feature engineering and machine learning modelling involve at a high level

We'll use pandasand numpyfor data manipulation, scipyfor signal smoothing and statistical tests, scikit-learnfor anomaly detection, and statsmodelsfor seasonal decomposition. Install them before running any code in this guide:

pip install pandas numpy scipy scikit-learn statsmodels

How to Audit Your Time Series Before Cleaning It
How to Reindex to a Canonical Frequency
How to Handle Missing Values
- Forward Fill — For Step-Function Signals
- Time-Weighted Interpolation — For Continuous Signals
- Seasonal Decomposition Imputation — For Long Gaps
How to Detect and Handle Outliers
- Z-Score with Rolling Window
- IQR-Based Outlier Detection
- Isolation Forest — For Multivariate Outlier Detection
- Outlier Treatment
How to Remove Duplicates
Frequency Alignment and Resampling
Smoothing Noise
- Exponential Weighted Moving Average
- Savitzky-Golay Filter
Schema and Sanity Validation
The Complete Cleaning Checklist

How to Audit Your Time Series Before Cleaning It

The first rule of data cleaning is: look before you cut. Before imputing, smoothing, or dropping anything, you need a complete picture of what's wrong and where.

A good audit covers the following:

The time index: Is it regular? Are there gaps?
Missing value distribution: Are missing values random or clustered?
Value range: Are there obvious gaps or sensor failures?
Duplicate timestamps

Let's spin up a sample dataset (with some of the above problems):

# Simulate one week of smart grid voltage readings (hourly)# with realistic problems injectedperiods = 168index = pd.date_range("2024-06-01", periods=periods, freq="H")voltage = (    230.0    + 3.5 * np.sin(2 * np.pi * np.arange(periods) / 24)    + np.random.normal(0, 1.2, periods))# Inject problemsvoltage[14:17] = np.nan          # sensor dropout: 3 consecutive missingvoltage[42] = np.nan             # isolated missingvoltage[78] = 312.4              # spike outliervoltage[101:104] = np.nan        # another dropoutvoltage[130] = 187.2             # dip outlierseries = pd.Series(voltage, index=index, name="voltage_v")# --- Audit ---print("=== TIME SERIES AUDIT ===")print(f"Period:        { series.index.min()} → { series.index.max()}")print(f"Observations:  { len(series)}")print(f"Expected freq: { pd.infer_freq(series.index)}")print(f"\nMissing values: { series.isna().sum()} ({ series.isna().mean()*100:.1f}%)")print(f"Value range:    [{ series.min():.2f}, { series.max():.2f}]")print(f"Mean ± Std:     { series.mean():.2f} ± { series.std():.2f}")# Identify consecutive missing runsmissing_mask = series.isna()missing_runs = []run_start = Nonefor i, (ts, is_missing) in enumerate(missing_mask.items()):    if is_missing and run_start is None:        run_start = ts    elif not is_missing and run_start is not None:        missing_runs.append((run_start, missing_mask.index[i - 1]))        run_start = Noneprint(f"\nMissing runs ({ len(missing_runs)} total):")for start, end in missing_runs:    print(f"  { start} → { end}")

Output:

=== TIME SERIES AUDIT ===Period:        2024-06-01 00:00:00 → 2024-06-07 23:00:00Observations:  168Expected freq: hMissing values: 7 (4.2%)Value range:    [187.20, 312.40]Mean ± Std:     230.22 ± 7.81Missing runs (3 total):  2024-06-01 14:00:00 → 2024-06-01 16:00:00  2024-06-02 18:00:00 → 2024-06-02 18:00:00  2024-06-05 05:00:00 → 2024-06-05 07:00:00

This audit gives you a map of the damage before you start cleaning. The key task is distinguishing between isolated missing values, which are imputable with local context, and missing long runs, which may need a different strategy or flagging for downstream consumers.

How to Reindex to a Canonical Frequency

Before imputing missing values, you need to confirm your time index is actually regular. A common problem in ingested time series is that missing timestamps are simply absent rather than represented as NaNrows — which means a .fillna()call will never find them.

# Simulate a sensor feed with missing timestamps (not just missing values)irregular_index = index.delete([14, 15, 16, 42, 101, 102, 103])irregular_series = series.dropna().reindex(irregular_index)print(f"Original length:   { len(series)}")print(f"Irregular length:  { len(irregular_series)}")print(f"Inferred freq:     { pd.infer_freq(irregular_series.index)}")  # None = irregular# Reindex to the full canonical hourly gridcanonical_index = pd.date_range(    start=irregular_series.index.min(),    end=irregular_series.index.max(),    freq="H")reindexed = irregular_series.reindex(canonical_index)print(f"\nAfter reindex:")print(f"Length:         { len(reindexed)}")print(f"Missing values: { reindexed.isna().sum()}")print(f"Inferred freq:  { pd.infer_freq(reindexed.index)}")

Output:

Original length:   168Irregular length:  161Inferred freq:     NoneAfter reindex:Length:         168Missing values: 7Inferred freq:  h

pd.infer_freqreturning Noneis your signal that the index has gaps. After reindexing to the canonical grid, missing timestamps become explicit NaNrows, and now your imputation logic can find them.

How to Handle Missing Values

Not all missing values should be handled the same way. A single isolated missing reading in a smooth signal is best filled with interpolation. A 3-hour sensor dropout in a volatile signal, however, might be better flagged than fabricated. Strategy should match both gap length and signal behavior.

Forward Fill — For Step-Function Signals

Forward fill is appropriate when the variable holds its last known value until something changes it — a machine state, a setpoint, a categorical flag.

# Equipment operating mode — a step signalmode_data = pd.Series(    ["running", "running", np.nan, np.nan, "idle", "idle", np.nan, "running"],    index=pd.date_range("2024-06-01", periods=8, freq="H"),    name="operating_mode")filled_mode = mode_data.ffill()print(pd.DataFrame({ "original": mode_data, "ffill": filled_mode}))

Output:

original    ffill2024-06-01 00:00:00  running  running2024-06-01 01:00:00  running  running2024-06-01 02:00:00      NaN  running2024-06-01 03:00:00      NaN  running2024-06-01 04:00:00     idle     idle2024-06-01 05:00:00     idle     idle2024-06-01 06:00:00      NaN     idle2024-06-01 07:00:00  running  running

Time-Weighted Interpolation — For Continuous Signals

For continuous sensor readings, linear interpolation weighted by time handles irregular gaps correctly because it doesn't assume equal spacing.

# Fill the voltage series using time-based interpolationvoltage_clean = reindexed.interpolate(method="time")# Compare original vs filled around the first gapgap_window = voltage_clean["2024-06-01 12:00":"2024-06-01 18:00"]original_window = reindexed["2024-06-01 12:00":"2024-06-01 18:00"]comparison = pd.DataFrame({     "original":     original_window,    "interpolated": gap_window.round(3),    "was_missing":  original_window.isna(),})print(comparison)

Output:

original  interpolated  was_missing2024-06-01 12:00:00  230.290355       230.290        False2024-06-01 13:00:00  226.798197       226.798        False2024-06-01 14:00:00         NaN       226.848         True2024-06-01 15:00:00         NaN       226.897         True2024-06-01 16:00:00         NaN       226.947         True2024-06-01 17:00:00  226.996356       226.996        False2024-06-01 18:00:00  225.410371       225.410        False

Seasonal Decomposition Imputation — For Long Gaps

For gaps longer than a few steps in a seasonal signal, interpolating across the gap ignores the seasonal pattern. A better approach is to decompose the series, impute each component separately, then reconstruct.

from statsmodels.tsa.seasonal import seasonal_decompose# Use a longer series for decomposition (needs enough periods)long_voltage = pd.Series(    230.0    + 3.5 * np.sin(2 * np.pi * np.arange(336) / 24)    + np.random.normal(0, 1.0, 336),    index=pd.date_range("2024-06-01", periods=336, freq="H"))# Inject a 6-hour gaplong_voltage.iloc[100:106] = np.nan# Interpolate first to give decompose a complete series to work withtemp_filled = long_voltage.interpolate(method="time")decomp = seasonal_decompose(temp_filled, model="additive", period=24)# Reconstruct: trend + seasonal + zero residual for missing positionsreconstructed = long_voltage.copy()missing_idx = long_voltage[long_voltage.isna()].indexreconstructed[missing_idx] = (    decomp.trend[missing_idx].fillna(method="ffill")    + decomp.seasonal[missing_idx])print(f"Missing before: { long_voltage.isna().sum()}")print(f"Missing after:  { reconstructed.isna().sum()}")print("\nFilled values at gap:")print(reconstructed[missing_idx].round(3))

Output:

original  interpolated  was_missing2024-06-01 12:00:00  230.290355       230.290        False2024-06-01 13:00:00  226.798197       226.798        False2024-06-01 14:00:00         NaN       226.848         True2024-06-01 15:00:00         NaN       226.897         True2024-06-01 16:00:00         NaN       226.947         True2024-06-01 17:00:00  226.996356       226.996        False2024-06-01 18:00:00  225.410371       225.410        False

The seasonal decomposition imputation respects the time-of-day pattern. As you can see, the filled values aren't a flat line across the gap but follow the expected daily curve.

How to Detect and Handle Outliers

Outliers in time series are trickier than in tabular data because context matters. For example, an unusually high or low voltage might be a sensor spike or a genuine grid event. You need methods that use temporal context, not just global statistics.

Z-Score with Rolling Window

A global Z-score misses local anomalies in non-stationary series. A rolling Z-score flags values that are unusual relative to their local neighbourhood.

Note: A non-stationary seriesis a time series whose statistical properties—such as mean, variance, or trend—change over time instead of remaining constant.

window = 24  # 24-hour rolling windowroll_mean = voltage_clean.rolling(window, center=True, min_periods=1).mean()roll_std  = voltage_clean.rolling(window, center=True, min_periods=1).std()rolling_z = (voltage_clean - roll_mean) / roll_stdthreshold = 3.0outliers_z = rolling_z[rolling_z.abs() > threshold]print(f"Rolling Z-score outliers detected: { len(outliers_z)}")print(outliers_z.round(3))

Output:

Rolling Z-score outliers detected: 22024-06-04 06:00:00    4.6462024-06-06 10:00:00   -4.484Name: voltage_v, dtype: float64

Z-score outlier detection works best for approximately Gaussian (normal) distributions because it assumes the data is centered around a mean with symmetric spread measured by standard deviation.

IQR-Based Outlier Detection

The interquartile range (IQR) method is more robust for detecting outliers in non-Gaussian distributions. The interquartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1), representing the spread of the middle 50% of the data.

Q1 = voltage_clean.quantile(0.25)Q3 = voltage_clean.quantile(0.75)IQR = Q3 - Q1lower_bound = Q1 - 1.5 * IQRupper_bound = Q3 + 1.5 * IQRoutliers_iqr = voltage_clean[    (voltage_clean < lower_bound) | (voltage_clean > upper_bound)]print(f"IQR bounds: [{ lower_bound:.2f}, { upper_bound:.2f}]")print(f"Outliers detected: { len(outliers_iqr)}")print(outliers_iqr.round(2))

Output:

IQR bounds: [220.16, 239.46]Outliers detected: 22024-06-04 06:00:00    312.42024-06-06 10:00:00    187.2Name: voltage_v, dtype: float64

Isolation Forest — For Multivariate Outlier Detection

When you have multiple sensors, an isolated reading on one channel might look normal, but its combination with readings from other channels reveals the anomaly. Isolation Forest handles this naturally.

# Build a multi-sensor DataFramenp.random.seed(42)n = 200sensor_df = pd.DataFrame({     "voltage_v":    230 + 3 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 1, n),    "current_a":    15  + 0.8 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 0.3, n),    "frequency_hz": 50  + np.random.normal(0, 0.05, n),}, index=pd.date_range("2024-06-01", periods=n, freq="H"))# Inject a multivariate anomaly — voltage drops, current spikes togethersensor_df.iloc[88, 0] = 194.2   # voltage dipsensor_df.iloc[88, 1] = 28.7    # current surge (consistent with fault)clf = IsolationForest(contamination=0.02, random_state=42)sensor_df["anomaly_score"] = clf.fit_predict(sensor_df[["voltage_v", "current_a", "frequency_hz"]])anomalies = sensor_df[sensor_df["anomaly_score"] == -1]print(f"Anomalies detected: { len(anomalies)}")print(anomalies[["voltage_v", "current_a", "frequency_hz"]].round(2))

Output:

Anomalies detected: 4                     voltage_v  current_a  frequency_hz2024-06-02 07:00:00     234.75      15.84         49.902024-06-04 06:00:00     233.09      15.82         50.152024-06-04 16:00:00     194.20      28.70         50.082024-06-06 05:00:00     235.09      15.41         49.91

In practice you'd follow up anomaly scores with domain-specific threshold rules.

Outlier Treatment

Once outliers are identified, you can handle them in several ways:

Cap them using Winsorization by limiting extreme values to a threshold.
Replace them with interpolated or estimated values.
Flag them so the model can handle them appropriately.

# Winsorize: cap at the IQR boundsvoltage_winsorized = voltage_clean.clip(lower=lower_bound, upper=upper_bound)# Replace outliers with time-interpolated valuesvoltage_outlier_fixed = voltage_clean.copy()voltage_outlier_fixed[outliers_iqr.index] = np.nanvoltage_outlier_fixed = voltage_outlier_fixed.interpolate(method="time")print("Outlier treatment comparison:")for ts in outliers_iqr.index:    print(f"\n  { ts}")    print(f"    Original:     { voltage_clean[ts]:.2f}")    print(f"    Winsorized:   { voltage_winsorized[ts]:.2f}")    print(f"    Interpolated: { voltage_outlier_fixed[ts]:.2f}")

Output:

Outlier treatment comparison:  2024-06-04 06:00:00    Original:     312.40    Winsorized:   239.46    Interpolated: 232.01  2024-06-06 10:00:00    Original:     187.20    Winsorized:   220.16    Interpolated: 231.43

Winsorization preserves the point but clips it to a plausible range — useful when you want to retain the information that something anomalous happened. Interpolation treats the outlier as if it were missing — better when you believe the reading is simply wrong.

How to Remove Duplicates

Duplicate timestamps are common when data pipelines retry on failure. Unlike tabular duplicates, time series duplicates aren't always identical, a retry might deliver a slightly different reading for the same timestamp.

# Inject duplicate timestamps with slightly different values (retry scenario)dup_index = index.tolist()dup_index.insert(20, index[20])  # exact duplicate timestampdup_index.insert(55, index[55])  # retry duplicatedup_values = voltage_clean.tolist()dup_values.insert(20, voltage_clean.iloc[20])dup_values.insert(55, voltage_clean.iloc[55] + 0.7)  # slightly different valuedup_series = pd.Series(dup_values, index=pd.DatetimeIndex(dup_index), name="voltage_v")print(f"Length with duplicates: { len(dup_series)}")print(f"Duplicate timestamps:   { dup_series.index.duplicated().sum()}")# Strategy 1: keep first (original reading)dedup_first = dup_series[~dup_series.index.duplicated(keep="first")]# Strategy 2: keep mean (average across retries)dedup_mean = dup_series.groupby(level=0).mean()print(f"\nAfter dedup (keep first): { len(dedup_first)}")print(f"After dedup (mean):       { len(dedup_mean)}")# Show the retry duplicatets_retry = index[55]print(f"\nRetry duplicate at { ts_retry}:")print(f"  Values:      { dup_series[ts_retry].values.round(3)}")print(f"  Keep first:  { dedup_first[ts_retry]:.3f}")print(f"  Mean:        { dedup_mean[ts_retry]:.3f}")

Output:

Length with duplicates: 170Duplicate timestamps:   2After dedup (keep first): 168After dedup (mean):       168Retry duplicate at 2024-06-03 07:00:00:  Values:      [235.198 234.498]  Keep first:  235.198  Mean:        234.848

For most sensor pipelines, keep-first is the right default; the first delivery is the original reading. Mean makes sense when retries come from independent sensors measuring the same quantity.

Frequency Alignment and Resampling

Real pipelines often mix data at different frequencies. For example, you may need a 1-minute meter reading merged with an hourly weather feed. Before joining them, you need to align frequencies explicitly.

# 1-minute power draw readingspower_1min = pd.Series(    42 + 18 * ((pd.date_range("2024-06-01", periods=1440, freq="T").hour.isin(range(8, 19)))).astype(int)    + np.random.normal(0, 2, 1440),    index=pd.date_range("2024-06-01", periods=1440, freq="T"),    name="power_kw")# Downsample to hourly: mean is appropriate for power (average over the hour)power_hourly_mean = power_1min.resample("H").mean().round(2)# Downsample to hourly: max (peak demand within the hour)power_hourly_max = power_1min.resample("H").max().round(2)# Downsample to hourly: sum (total energy = kWh)energy_hourly_kwh = (power_1min.resample("H").sum() / 60).round(3)comparison = pd.DataFrame({     "mean_kw":    power_hourly_mean,    "peak_kw":    power_hourly_max,    "energy_kwh": energy_hourly_kwh,}).iloc[7:13]print(comparison)

Output:

mean_kw  peak_kw  energy_kwh2024-06-01 07:00:00    42.13    46.28      42.1332024-06-01 08:00:00    60.56    64.81      60.5572024-06-01 09:00:00    59.91    64.88      59.9122024-06-01 10:00:00    60.07    65.16      60.0662024-06-01 11:00:00    60.08    64.99      60.0832024-06-01 12:00:00    59.72    63.65      59.724

Which aggregation you choose matters enormously for downstream use. Mean power is right for load profiling. Peak power is right for capacity planning. Sum (converted to kWh) is right for billing. You can probably see why the rightanswer is domain-specific and not technical.

Smoothing Noise

Raw sensor data often contains high-frequency noise that obscures the underlying signal. Smoothing before feature engineering prevents the model from fitting to noise, but over-smoothing destroys real variation.

Exponential Weighted Moving Average

Exponential Weighted Moving Average or EWMA gives more weight to recent observationsand adapts quickly to level changes. This is better than a simple moving average for non-stationary signals.

# Noisy temperature sensor (°C)temp_noisy = pd.Series(    3.5    + 1.2 * np.sin(2 * np.pi * np.arange(168) / 24)    + np.random.normal(0, 0.8, 168),  # high noise    index=pd.date_range("2024-06-01", periods=168, freq="H"),    name="temperature_c")temp_ewma = temp_noisy.ewm(span=6, adjust=False).mean()temp_sma  = temp_noisy.rolling(window=6, center=True).mean()comparison = pd.DataFrame({     "raw":  temp_noisy,    "ewma": temp_ewma.round(3),    "sma":  temp_sma.round(3),}).iloc[22:30]print(comparison)

Output:

raw   ewma    sma2024-06-01 22:00:00  3.212372  2.843  3.0352024-06-01 23:00:00  3.106840  2.918  3.1762024-06-02 00:00:00  3.712290  3.145  3.0112024-06-02 01:00:00  3.344376  3.202  3.2942024-06-02 02:00:00  2.148946  2.901  3.7052024-06-02 03:00:00  4.241105  3.284  4.0872024-06-02 04:00:00  5.677429  3.968  4.3812024-06-02 05:00:00  5.400083  4.377  4.765

Savitzky-Golay Filter

For signals where you need to preserve peak shapes — not just smooth them away — the Savitzky-Golay filter fits a polynomial over a sliding window and is better at maintaining the height of genuine spikes.

from scipy.signal import savgol_filtertemp_savgol = pd.Series(    savgol_filter(temp_noisy.values, window_length=11, polyorder=2),    index=temp_noisy.index,    name="temp_savgol").round(3)print(pd.DataFrame({     "raw":    temp_noisy,    "savgol": temp_savgol,}).iloc[22:30])

Output:

raw  savgol2024-06-01 22:00:00  3.212372   2.9602024-06-01 23:00:00  3.106840   2.9442024-06-02 00:00:00  3.712290   3.1142024-06-02 01:00:00  3.344376   3.3792024-06-02 02:00:00  2.148946   3.8092024-06-02 03:00:00  4.241105   4.2882024-06-02 04:00:00  5.677429   4.7492024-06-02 05:00:00  5.400083   5.138

Schema and Sanity Validation

Cleaning without validation is incomplete. You need automated checks that run every time new data arrives — catching problems before they silently corrupt downstream models.

def validate_time_series(series: pd.Series, config: dict) -> dict:    """    Run schema and sanity checks on a time series.    Returns a report dict with pass/fail per check.    """    report = { }    # Frequency check    inferred = pd.infer_freq(series.index)    report["freq_regular"] = inferred == config["expected_freq"]    # Missing value threshold    missing_rate = series.isna().mean()    report["missing_below_threshold"] = missing_rate <= config["max_missing_rate"]    report["missing_rate"] = round(missing_rate, 4)    # Value range check    in_range = series.dropna().between(config["min_value"], config["max_value"])    report["values_in_range"] = in_range.all()    report["out_of_range_count"] = (~in_range).sum()    # Duplicate timestamps    report["no_duplicates"] = not series.index.duplicated().any()    # Monotonic index    report["index_monotonic"] = series.index.is_monotonic_increasing    return reportconfig = {     "expected_freq":    "H",    "max_missing_rate": 0.05,    "min_value":        210.0,    "max_value":        250.0,}report = validate_time_series(voltage_outlier_fixed, config)print("=== VALIDATION REPORT ===")for check, result in report.items():    if check in ("missing_rate", "out_of_range_count"):        print(f"  { check}: { result}")    else:        status = "✓ PASS" if result else "✗ FAIL"        print(f"  { status}  { check}")

Output:

=== VALIDATION REPORT ===  ✗ FAIL  freq_regular  ✓ PASS  missing_below_threshold  missing_rate: 0.0  ✓ PASS  values_in_range  out_of_range_count: 0  ✓ PASS  no_duplicates  ✓ PASS  index_monotonic

This validator is the kind of function you wrap around every data ingestion step in a production pipeline. Run it before cleaning to know what's broken, and after cleaning to confirm everything passed.

The Complete Cleaning Checklist

Here's the full sequence to run on any incoming time series dataset:

Step	Technique	When to Use
Audit	Index check, missing map, value range	Always — before anything else
Reindex	`reindex`to canonical frequency	When timestamps are absent rather than NaN
Missing: short gaps	Time interpolation	Continuous signals, gaps ≤ 3 steps
Missing: step signals	Forward fill	Categorical or setpoint data
Missing: long gaps	Seasonal decomposition impute	Seasonal signals, gaps > 6 steps
Outliers: univariate	Rolling Z-score or IQR	Single sensor, local anomalies
Outliers: multivariate	Isolation Forest	Multiple correlated sensors
Outlier treatment	Winsorize or interpolate	Depending on whether event is real
Duplicates	Keep first or group mean	Pipeline retry duplicates
Resampling	`.resample()`with correct aggregation	Frequency alignment before joins
Smoothing	EWMA or Savitzky-Golay	Noisy sensors before feature engineering
Validation	Schema + sanity checks	After cleaning, and on every new batch

Wrapping Up

The order matters. Reindex before imputing. Impute before smoothing. Validate after everything. Skipping steps or doing them out of order compounds errors in ways that are very difficult to trace back once you're looking at model predictions.

Time series cleaning isn't glamorous work, but a model trained on clean data and thoughtfully engineered features will almost always outperform a more sophisticated model trained on data that wasn't cleaned properly. Getting this pipeline right is the highest-leverage thing you can do before you try running even the simplest algorithm on your time series data.

How to Clean Time Series Data in Python

Prerequisites

How to Audit Your Time Series Before Cleaning It

How to Reindex to a Canonical Frequency

How to Handle Missing Values

Forward Fill — For Step-Function Signals

Time-Weighted Interpolation — For Continuous Signals

Seasonal Decomposition Imputation — For Long Gaps

How to Detect and Handle Outliers

Z-Score with Rolling Window

IQR-Based Outlier Detection

Isolation Forest — For Multivariate Outlier Detection

Outlier Treatment

How to Remove Duplicates

Frequency Alignment and Resampling

Smoothing Noise

Exponential Weighted Moving Average

Savitzky-Golay Filter

Schema and Sanity Validation

The Complete Cleaning Checklist

Wrapping Up

More From This Topic

Project Management Platform

How to Run Open Source LLMs on Your Own Computer Using Ollama

How to Build a Market Research Copilot with MCP and Python [Full Handbook]

How to Build a Barcode Generator Using JavaScript (Step

Queue Management System

AI in Finance: Transforming Investments and Banking in the Digital Age

Recommended Reads

Latest Technology Updates