I got very weird issue.
I've imported well-known Retail Stockout prediction dataset in CSV format. I've imported the dataset to the Vertex AI Datasets using google.cloud.aiplatform.TabularDataset python library code.
Most columns have the "Wk_" prefix. The screenshot shows that there is only one column with "2016_43_Quantity" in it - "Wk_2016_43_Quantity" column. Just like in the source CSV.
Everything is fine.
But here is the problem:
When I call the API to get the dataset metadata including the column names, all column names are fine except one which is stated as "WWk_2016_43_Quantity". (Notice the double "W" in the "WWk_" prefix).
In context:
...
'Wk_2016_42_Quantity',
'WWk_2016_43_Quantity',
'Wk_2016_44_Quantity',
...
This discrepancy causes the subsequent model training to fail due to the dataset not having the WWk_2016_43_Quantity column (it has Wk_2016_43_Quantity instead).
I do not understand how this could have happened, but you can easily examine the imported dataset and see that the UX and and what returned by the google-cloud-aiplatform library differs.
Environment details
- OS type and version: Linux
- Python version: 3.7
google-cloud-aiplatform version: 1.1.1
Steps to reproduce
- Create dataset from the "gs://kubeflow-pipelines-regional-us-central1/mirror/cloud-ml-data/automl-tables/notebooks/stockout.csv" file
- Try getting its columns
Code example
from google.cloud import aiplatform
print(aiplatform.TabularDataset('projects/140626129697/locations/us-central1/datasets/2405036550225133568').column_names)
I got very weird issue.
I've imported well-known Retail Stockout prediction dataset in CSV format. I've imported the dataset to the Vertex AI Datasets using google.cloud.aiplatform.TabularDataset python library code.
Most columns have the "Wk_" prefix. The screenshot shows that there is only one column with "2016_43_Quantity" in it - "Wk_2016_43_Quantity" column. Just like in the source CSV.
Everything is fine.
But here is the problem:
When I call the API to get the dataset metadata including the column names, all column names are fine except one which is stated as "WWk_2016_43_Quantity". (Notice the double "W" in the "WWk_" prefix).
In context:
...
'Wk_2016_42_Quantity',
'WWk_2016_43_Quantity',
'Wk_2016_44_Quantity',
...
This discrepancy causes the subsequent model training to fail due to the dataset not having the
WWk_2016_43_Quantitycolumn (it hasWk_2016_43_Quantityinstead).I do not understand how this could have happened, but you can easily examine the imported dataset and see that the UX and and what returned by the google-cloud-aiplatform library differs.
Environment details
google-cloud-aiplatformversion: 1.1.1Steps to reproduce
Code example