Skip to content

Weird bug in TabularDataset.column_names #589

@Ark-kun

Description

@Ark-kun

I got very weird issue.
I've imported well-known Retail Stockout prediction dataset in CSV format. I've imported the dataset to the Vertex AI Datasets using google.cloud.aiplatform.TabularDataset python library code.

Most columns have the "Wk_" prefix. The screenshot shows that there is only one column with "2016_43_Quantity" in it - "Wk_2016_43_Quantity" column. Just like in the source CSV.
Everything is fine.

But here is the problem:
When I call the API to get the dataset metadata including the column names, all column names are fine except one which is stated as "WWk_2016_43_Quantity". (Notice the double "W" in the "WWk_" prefix).
In context:
...
'Wk_2016_42_Quantity',
'WWk_2016_43_Quantity',
'Wk_2016_44_Quantity',
...

This discrepancy causes the subsequent model training to fail due to the dataset not having the WWk_2016_43_Quantity column (it has Wk_2016_43_Quantity instead).

I do not understand how this could have happened, but you can easily examine the imported dataset and see that the UX and and what returned by the google-cloud-aiplatform library differs.

Environment details

  • OS type and version: Linux
  • Python version: 3.7
  • google-cloud-aiplatform version: 1.1.1

Steps to reproduce

  1. Create dataset from the "gs://kubeflow-pipelines-regional-us-central1/mirror/cloud-ml-data/automl-tables/notebooks/stockout.csv" file
  2. Try getting its columns

Code example

from google.cloud import aiplatform
print(aiplatform.TabularDataset('projects/140626129697/locations/us-central1/datasets/2405036550225133568').column_names)

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: aiplatformIssues related to the AI Platform API.triage meI really want to be triaged.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions