Skip to content

feat: Added tabular data support from BQ and GCS during dataset creation#40

Merged
ivanmkc merged 3 commits into
googleapis:devfrom
ivanmkc:imkc--bq-import
Nov 6, 2020
Merged

feat: Added tabular data support from BQ and GCS during dataset creation#40
ivanmkc merged 3 commits into
googleapis:devfrom
ivanmkc:imkc--bq-import

Conversation

@ivanmkc
Copy link
Copy Markdown
Contributor

@ivanmkc ivanmkc commented Nov 4, 2020

Description

This PR adds BQ import support for tabular datasets during Dataset creation.

Fixes https://b.corp.google.com/issues/171311614

Testing:

  1. Confirmed that a dataset can be created with BQ data imported:
my_dataset = Dataset.create(
    display_name=_TEST_DISPLAY_NAME,
    metadata_schema_uri=schema.dataset.metadata.tabular,
    bq_source="bq://bigquery-public-data.austin_311.311_request",
)

Validation

I've decided to hold off on validation for now until I have a clearer picture on what to catch at dataset creation and when to wait for the underlying API to throw an error. Will do this in a later PR. This includes error cases like:

  • 1.1.1 Creating a tabular dataset without specifying a GCS/BQ datasource
  • 1.1.2 Creating a tabular dataset specifying a GCS datasource but datasource is in an invalid format (such as malformed CSV)
  • 1.1.3 Creating a tabular dataset while specifying a non-existent BQ datasource
  • 1.1.4 Creating a tabular dataset with a GCS datasource while providing an import format
  • 1.1.5 Creating a non-tabular dataset while setting a BQ datasource

It's likely that I'll have add this validation as it seems like everything works fine at the Dataset creation step but can potentially break downstream.

Details listed here: https://docs.google.com/document/d/1eyNM8GGf-JEtsZRLCm5SVXQXddvuQeODSEEsKv-mHJ4/edit?usp=sharing

@ivanmkc ivanmkc requested a review from a team November 4, 2020 02:25
@google-cla google-cla Bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 4, 2020
@ivanmkc ivanmkc changed the base branch from master to dev November 4, 2020 02:25
my_dataset = Dataset.create(
display_name=_TEST_DISPLAY_NAME,
source=_TEST_SOURCE_URI,
gcs_source=_TEST_SOURCE_URI_GCS,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming for clarity


@pytest.mark.usefixtures("get_dataset_mock")
def test_create_dataset(self, create_dataset_mock):
def test_create_dataset_nontabular(self, create_dataset_mock):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-tabular dataset doesn't need metadata set for it.

Comment thread tests/unit/aiplatform/test_datasets.py Outdated
parent=_TEST_PARENT, dataset=expected_dataset, metadata=()
)

# TODO: test_create_dataset_nontabular_with_bq_source <- should raise a value error
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO later in this PR

"gs://my-bucket/index_file_2.jsonl",
"gs://my-bucket/index_file_3.jsonl",
]
_TEST_SOURCE_URI_BQ = "bigquery://my-project/my-dataset"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated BQ and GCS sources for testing

@@ -0,0 +1,9 @@
class training_job:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread google/cloud/aiplatform/datasets.py Outdated
An object representing a long-running operation.
"""
# TODO(b/171311614): Add support for BiqQuery import source
# Should throw error if BQ source is received
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO in this PR

Comment thread google/cloud/aiplatform/datasets.py Outdated
)

def _import(
def _import_gcs(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made explicit that this only works with GCS

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_import_from_gcs

Comment thread google/cloud/aiplatform/datasets.py Outdated
labels=labels,
)

if dataset_metadata is None:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was used for testing, will remove.

parent: str,
display_name: str,
metadata_schema_uri: str,
dataset_metadata: Dict,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made explicitly non-optional as it's required in GapicDataset.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible dataset_metadata can have a stricter type signature?

Dict[str, Any] or Dict[str, Dict]?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sasha-gitg I'm checking the proto for more guidance on the exact data structure but it just says it's a struct:

https://source.corp.google.com/piper///depot/google3/google/cloud/aiplatform/master/dataset.proto;l=69


# If an import source was not provided, return empty created Dataset.
if not source:
if gcs_source and not is_tabular_dataset_metadata:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabular datasets must be imported at creation time.

Comment thread google/cloud/aiplatform/datasets.py Outdated
api_client = cls._instantiate_client(location=location, credentials=credentials)

# If this is tabular enrich the dataset metadata with source
# TODO: Use interfaces to abstract away gcs and bq specific logic
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May experiment with this in a later PR.

that can be used here are found in gs://google-cloud-
aiplatform/schema/dataset/metadata/.
source: Optional[Sequence[str]]=None:
gcs_source: Optional[Sequence[str]]=None:
Copy link
Copy Markdown
Contributor Author

@ivanmkc ivanmkc Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcs_source seems to be an array or URI's while bq_source seems to be a single URI.

Combining them into one parameter seems messy (even if both were the same datatype) so I split them.
By messy, I mean it'll be unclear what datatype to use and there will have to be logic to validate BQ and GCS types. It seems cleaner to use the type system to differentiate them. I can combine them into a single source parameter again once I refactor it to use a single interface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there's specific logic that changes depending on bq_source and gcs_source which I feel so really be abstracted out to decouple Dataset from GCS and BQ, as well as allow future extensibility. I'll put a PR together later for discussion.

Comment thread google/cloud/aiplatform/datasets.py Outdated
information on wildcards, see
https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames.
bq_source: Optional[str]=None:
BigQuery URI to the input dataset.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's meant by URI? Could you provide an example?

Also, does "dataset" refer to a BigQuery dataset or table?

Copy link
Copy Markdown
Contributor Author

@ivanmkc ivanmkc Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a BigQuery table, let me make it more explicit.

It's one of these, what's our term for this?

bq://bigquery-public-data.austin_311.table123

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. I've never seen a BQ URI like that before.

@shollyman Does this look familiar to you? Are there other places we use the bq:// pseudo-protocol?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@ivanmkc ivanmkc Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke with the MBSDK team and they said the format I provided is quite common. Have you seen another version?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BQ itself doesn't use this prefix format in any place I'm aware. Storage resources have multiple formats (TableReference), OP style string names (projects/p/datasets/d/tables/t), sql identifiers (p.d.t) depending on where they're getting consumed.

I can't find any reference to it in BQ code, but there's lots of references from aiplatform things, and related things in the ML space (kaggle).

@ivanmkc ivanmkc changed the title [WIP] BQ import support Added BQ import support during dataset creation Nov 4, 2020
@ivanmkc ivanmkc changed the title Added BQ import support during dataset creation Added tabular data support from BQ and GCS during dataset creation Nov 4, 2020
Comment thread google/cloud/aiplatform/datasets.py Outdated
"input_config": {"bigquery_source": {"uri": bq_source}}
}

# TODO: Remove this and let the error propagate from up from downstream?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm about to remove this until more discussion regarding the validation. See PR description. Thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to keep.

Comment thread google/cloud/aiplatform/datasets.py Outdated
is defined as an OpenAPI 3.0.2 Schema Object. The schema files
that can be used here are found in gs://google-cloud-
aiplatform/schema/dataset/metadata/.
TODO: Add dataset_metadata info
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I just write this doc string or do I loop in a TW?

@ivanmkc ivanmkc force-pushed the imkc--bq-import branch 2 times, most recently from fd3e7eb to d3c18b0 Compare November 4, 2020 23:21
display_name=_TEST_DISPLAY_NAME,
metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_NONTABULAR,
labels=_TEST_LABEL,
metadata={},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the lack of meta-data here.

metadata_schema_uri=_TEST_METADATA_SCHEMA_URI,
metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR,
labels=_TEST_LABEL,
metadata=_TEST_METADATA_TABULAR_BQ,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that creating a tabular dataset sets the metadata field.

Copy link
Copy Markdown
Member

@sasha-gitg sasha-gitg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. Looks good!

@@ -23,6 +23,7 @@
from google.cloud.aiplatform import base
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View this conversation on imports: #42 (comment)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. Don't think there is anything to do here.

Comment thread google/cloud/aiplatform/datasets.py Outdated
"input_config": {"bigquery_source": {"uri": bq_source}}
}

# TODO: Remove this and let the error propagate from up from downstream?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to keep.

Comment thread google/cloud/aiplatform/datasets.py
parent: str,
display_name: str,
metadata_schema_uri: str,
dataset_metadata: Dict,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible dataset_metadata can have a stricter type signature?

Dict[str, Any] or Dict[str, Dict]?

gapic_dataset = GapicDataset(
display_name=display_name,
metadata_schema_uri=metadata_schema_uri,
metadata=dataset_metadata,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dizcology Why don't we have to use json_format.ParseDict here?

Copy link
Copy Markdown
Contributor Author

@ivanmkc ivanmkc Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to your other question, it seems to work fine without json_format.ParseDict.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By works fine, I mean a dataset is created correctly on the cloud.

Comment thread google/cloud/aiplatform/datasets.py Outdated
)

def _import(
def _import_gcs(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_import_from_gcs

pip3 install flake8 black==19.10b0
black docs google tests noxfile.py setup.py
flake8 google tests
@ivanmkc ivanmkc changed the title Added tabular data support from BQ and GCS during dataset creation feat: Added tabular data support from BQ and GCS during dataset creation Nov 6, 2020
@ivanmkc ivanmkc merged commit 5b098a4 into googleapis:dev Nov 6, 2020
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
…ion (googleapis#40)

* feat: Add tabular import support from GCS and BQ

* fix: Ran linter again with following commands

pip3 install flake8 black==19.10b0
black docs google tests noxfile.py setup.py
flake8 google tests

* fix: Moved instantiation after validation and renamed a _import_gcs function

Co-authored-by: Ivan Cheung <ivanmkc@google.com>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
…ion (googleapis#40)

* feat: Add tabular import support from GCS and BQ

* fix: Ran linter again with following commands

pip3 install flake8 black==19.10b0
black docs google tests noxfile.py setup.py
flake8 google tests

* fix: Moved instantiation after validation and renamed a _import_gcs function

Co-authored-by: Ivan Cheung <ivanmkc@google.com>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
…ion (googleapis#40)

* feat: Add tabular import support from GCS and BQ

* fix: Ran linter again with following commands

pip3 install flake8 black==19.10b0
black docs google tests noxfile.py setup.py
flake8 google tests

* fix: Moved instantiation after validation and renamed a _import_gcs function

Co-authored-by: Ivan Cheung <ivanmkc@google.com>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
…ion (googleapis#40)

* feat: Add tabular import support from GCS and BQ

* fix: Ran linter again with following commands

pip3 install flake8 black==19.10b0
black docs google tests noxfile.py setup.py
flake8 google tests

* fix: Moved instantiation after validation and renamed a _import_gcs function

Co-authored-by: Ivan Cheung <ivanmkc@google.com>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
…ion (googleapis#40)

* feat: Add tabular import support from GCS and BQ

* fix: Ran linter again with following commands

pip3 install flake8 black==19.10b0
black docs google tests noxfile.py setup.py
flake8 google tests

* fix: Moved instantiation after validation and renamed a _import_gcs function

Co-authored-by: Ivan Cheung <ivanmkc@google.com>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Dec 22, 2020
…ion (googleapis#40)

* feat: Add tabular import support from GCS and BQ

* fix: Ran linter again with following commands

pip3 install flake8 black==19.10b0
black docs google tests noxfile.py setup.py
flake8 google tests

* fix: Moved instantiation after validation and renamed a _import_gcs function

Co-authored-by: Ivan Cheung <ivanmkc@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants