DuckDB Test Failure: Column Count Mismatch In SQLLogicTest

Aug 15, 2025 by Luna Greco 59 views

Python SQLLogicTest Failure: Wrong Column Count in DuckDB

Hey everyone,

We've got a bit of a situation in the DuckDB testing realm. It seems like our Python SQLLogicTest Library (on Linux) is throwing a "Wrong column count in query!" error. Specifically, this is happening in the test/sql/copy/row_groups_per_file.test file, around line 58. Let's dive into the details and see what's up.

Understanding the Error

The core issue is that the test expects 6 columns in the query result, but it's only getting 1. This mismatch is causing the test to fail. The error message pinpoints the exact location of the problem: /home/runner/work/duckdb/duckdb/test/sql/copy/row_groups_per_file.test:58. This helps us narrow down the scope and focus our debugging efforts.

The Culprit: COPY Command with Parquet Format

The failing query involves a COPY command, which is used to export data from DuckDB to a file. In this case, the data is being exported to Parquet format with specific configurations:

COPY bigdata TO '/tmp/pytest-of-runner/pytest-0/test_sqllogic_test_sql_copy_ro0/row_groups_per_file_stats/' (
    FORMAT PARQUET,
    WRITE_EMPTY_FILE false,
    FILENAME_PATTERN '{uuid}',
    ROW_GROUP_SIZE 3000,
    ROW_GROUPS_PER_FILE 2,
    RETURN_STATS
);

This command is designed to copy data from a table named bigdata to Parquet files, splitting the data into row groups and generating filenames based on UUIDs. The RETURN_STATS option is particularly interesting, as it suggests that the query should return statistics about the copied data.

Expected vs. Actual Results

The test defines a set of expected results, which include the filename, size, and other statistics for the generated Parquet files. Here's a snippet of the expected output:

('/tmp/pytest-of-runner/pytest-0/test_sqllogic_test_sql_copy_ro0/row_groups_per_file_stats//a5b4e361-00bb-406a-a54b-374e93e0091f.parquet', 8192, 67691, 465, {'"col_a"': {'column_size_bytes': '33607', 'max': '8191', 'min': '0', 'null_count': '0'}, '"col_b"': {'column_size_bytes': '33607', 'max': '8191', 'min': '0', 'null_count': '0'}}, None)
('/tmp/pytest-of-runner/pytest-0/test_sqllogic_test_sql_copy_ro0/row_groups_per_file_stats//d5b411cb-9db6-4f02-89fb-93d73b6172ff.parquet', 1808, 14847, 281, {'"col_a"': {'column_size_bytes': '7277', 'max': '9999', 'min': '8192', 'null_count': '0'}, '"col_b"': {'column_size_bytes': '7277', 'max': '9999', 'min': '8192', 'null_count': '0'}}, None)

However, the actual result is quite different. It appears to be a single line of tab-separated values that match a regular expression pattern. This suggests that the output is not being parsed correctly, or that the query is not returning the expected number of columns.

<REGEX>:.*row_groups_per_file_stats.*[a-zA-Z0-9-]{36}.parquet <REGEX>:\d+ <REGEX>:\d+ <REGEX>:\d+ <REGEX>:{'"col_a"'={column_size_bytes=\d+, max=\d+, min=\d+, null_count=0}, '"col_b"'={column_size_bytes=\d+, max=\d+, min=\d+, null_count=0}} NULL
<REGEX>:.*row_groups_per_file_stats.*[a-zA-Z0-9-]{36}.parquet <REGEX>:\d+ <REGEX>:\d+ <REGEX>:\d+ <REGEX>:{'"col_a"'={column_size_bytes=\d+, max=\d+, min=\d+, null_count=0}, '"col_b"'={column_size_bytes=\d+, max=\d+, min=\d+, null_count=0}} NULL

The error message "Error in test! Column count mismatch after splitting on tab on row 1!" further confirms that the test is attempting to split the output based on tabs, but it's finding an unexpected number of columns.

Investigating the Root Cause

So, what could be causing this discrepancy? Here are a few potential areas to investigate:

Parquet Statistics Output: The RETURN_STATS option in the COPY command is intended to return statistics about the Parquet files. It's possible that there's an issue with how these statistics are being formatted or returned. Perhaps the number of columns returned by the RETURN_STATS option has changed, and the test hasn't been updated to reflect this.
Tab Delimiter Issue: The error message about splitting on tabs suggests that the test might be expecting a tab-separated output. However, the actual output seems to be a single line with tab characters embedded within the data. This could indicate a problem with how the output is being formatted or escaped.
Regular Expression Mismatch: The use of regular expressions in the actual result suggests that the test is trying to match a pattern rather than an exact string. It's possible that the regular expression is not correctly matching the output, or that the output format has changed slightly, causing the match to fail.
Environment Differences: The failure is occurring in the NightlyTests workflow on Linux. It's worth considering whether there might be environment-specific differences that are affecting the behavior of the COPY command or the output formatting. Different versions of libraries or tools could potentially lead to variations in the results.

Steps to Resolve the Issue

To get this test passing again, we need to dig deeper and identify the root cause. Here's a plan of action:

Examine the DuckDB Code: We should review the code related to the COPY command, particularly the parts that handle Parquet output and the RETURN_STATS option. This will help us understand how the statistics are being generated and formatted.
Inspect the Test Code: We need to carefully examine the test code in test/sql/copy/row_groups_per_file.test to understand how it's parsing the output and what it expects the format to be. We should pay close attention to the regular expressions and the column splitting logic.
Reproduce the Issue Locally: The best way to debug this is to reproduce the failure locally. This will allow us to run the test in a controlled environment and use debugging tools to inspect the output and the code execution.
Update the Test or the Code: Once we've identified the root cause, we'll need to either update the test to match the new output format or fix the code to generate the expected output. It depends on whether the behavior change is intentional or a bug.

Workflow and Branch Information

For those wanting to dive into the details, here's some relevant information about the failed workflows:

Workflow: NightlyTests (16980034664)
- Failed job: Python SQLLogicTest Library (Linux)
- Branch: main (421152158231ac5fe2d04e9958cedf06cb4aee64)
Workflow: NightlyTests (16980034662)
- Failed job: Python SQLLogicTest Library (Linux)
- Branch: v1.3-ossivalis (17c093c0bbafaeeec6d614aad298b7357cd10f39)

This information can help you track down the specific commits and code versions that are involved in the failure.

Conclusion

The "Wrong column count in query!" error in the DuckDB Python SQLLogicTest is a tricky one, but by systematically investigating the code, the test, and the environment, we can get to the bottom of it. Let's roll up our sleeves and get this fixed, guys!