数据清洗核对脚本

article.txt

这个脚本能检查相邻两列的数据一致性并把B列填充到A列，如果A列缺失数据的话

首先检查 CSV 文件的总列数是否符合要求（应为单数，因为第一列作为主键列不参与数据操作），如果不符合则抛出异常。
提取出主键列、集合 A 和集合 B 的列的表头，并打印它们的对应关系，然后询问用户是否继续操作。
如果用户确认继续，分别读取集合 A 和集合 B 对应的数据行（跳过第一列）。
对于每组对应的列，检查集合 A 中无数据但集合 B 中有数据的单元格，并将集合 B 中的数据填充到集合 A 中。
最后，以写入模式打开一个新的 CSV 文件，将主键列和处理后的集合 A 的数据（包括表头）写入新的 CSV 文件中。

import csv


def process_csv(input_file):
    with open(input_file, 'r', encoding='utf-8', newline='') as csvfile:
        reader = csv.reader(csvfile)
        headers = next(reader)
        num_cols = len(headers)

        if num_cols % 2!= 1:
            raise ValueError("The total number of columns in the CSV should be odd, as the first column is the primary key column and not included in the data operations.")

        primary_key_column = headers[0]
        set_a_cols = headers[1::2]
        set_b_cols = headers[2::2]

        print(f"Primary key column: {primary_key_column}")
        print("Correspondence between columns:")
        for a_header, b_header in zip(set_a_cols, set_b_cols):
            print(f"{a_header} corresponds to {b_header}")

        confirmation = input("Do you want to continue? (yes/no): ")
        if confirmation.lower()!= "yes":
            print("Operation aborted.")
            return

        data_a = []
        data_b = []
        original_first_column_data = []

        for row in reader:
            original_first_column_data.append(row[0])
            data_a.append(row[1::2])
            data_b.append(row[2::2])

        for i in range(len(data_a)):
            for j in range(len(data_a[i])):
                if not data_a[i][j] and data_b[i][j]:
                    data_a[i][j] = data_b[i][j]

        output_file = "new_output.csv"
        with open(output_file, 'w', encoding='utf-8', newline='') as outfile:
            writer = csv.writer(outfile)
            writer.writerow([primary_key_column] + set_a_cols)

            for i in range(len(data_a)):
                writer.writerow([original_first_column_data[i]] + data_a[i])


if __name__ == "__main__":
    input_csv_file = "opt.csv"
    process_csv(input_csv_file)

Prev Home Next

数据清洗核对脚本

Comments