feat: Added tabular data support from BQ and GCS during dataset creation by ivanmkc · Pull Request #40 · googleapis/python-aiplatform

ivanmkc · 2020-11-04T02:25:20Z

Description

This PR adds BQ import support for tabular datasets during Dataset creation.

Fixes https://b.corp.google.com/issues/171311614

Testing:

Confirmed that a dataset can be created with BQ data imported:

my_dataset = Dataset.create(
    display_name=_TEST_DISPLAY_NAME,
    metadata_schema_uri=schema.dataset.metadata.tabular,
    bq_source="bq://bigquery-public-data.austin_311.311_request",
)

Validation

I've decided to hold off on validation for now until I have a clearer picture on what to catch at dataset creation and when to wait for the underlying API to throw an error. Will do this in a later PR. This includes error cases like:

1.1.1 Creating a tabular dataset without specifying a GCS/BQ datasource
1.1.2 Creating a tabular dataset specifying a GCS datasource but datasource is in an invalid format (such as malformed CSV)
1.1.3 Creating a tabular dataset while specifying a non-existent BQ datasource
1.1.4 Creating a tabular dataset with a GCS datasource while providing an import format
1.1.5 Creating a non-tabular dataset while setting a BQ datasource

It's likely that I'll have add this validation as it seems like everything works fine at the Dataset creation step but can potentially break downstream.

Details listed here: https://docs.google.com/document/d/1eyNM8GGf-JEtsZRLCm5SVXQXddvuQeODSEEsKv-mHJ4/edit?usp=sharing

ivanmkc · 2020-11-04T05:08:30Z

        my_dataset = Dataset.create(
            display_name=_TEST_DISPLAY_NAME,
-            source=_TEST_SOURCE_URI,
+            gcs_source=_TEST_SOURCE_URI_GCS,


Renaming for clarity

ivanmkc · 2020-11-04T05:18:48Z


    @pytest.mark.usefixtures("get_dataset_mock")
-    def test_create_dataset(self, create_dataset_mock):
+    def test_create_dataset_nontabular(self, create_dataset_mock):


Non-tabular dataset doesn't need metadata set for it.

ivanmkc · 2020-11-04T05:21:05Z

            parent=_TEST_PARENT, dataset=expected_dataset, metadata=()
        )

+    # TODO: test_create_dataset_nontabular_with_bq_source <- should raise a value error


TODO later in this PR

ivanmkc · 2020-11-04T05:21:24Z

    "gs://my-bucket/index_file_2.jsonl",
    "gs://my-bucket/index_file_3.jsonl",
 ]
+_TEST_SOURCE_URI_BQ = "bigquery://my-project/my-dataset"


Separated BQ and GCS sources for testing

ivanmkc · 2020-11-04T05:22:08Z

@@ -0,0 +1,9 @@
+class training_job:


Added schema from Sasha's branch: https://github.com/sasha-gitg/python-aiplatform/blob/custom_training/google/cloud/aiplatform/schema.py

ivanmkc · 2020-11-04T05:22:24Z

                An object representing a long-running operation.
        """
        # TODO(b/171311614): Add support for BiqQuery import source
+        # Should throw error if BQ source is received


TODO in this PR

ivanmkc · 2020-11-04T05:22:48Z

        )

-    def _import(
+    def _import_gcs(


Made explicit that this only works with GCS

_import_from_gcs

ivanmkc · 2020-11-04T05:23:02Z

            labels=labels,
        )

+        if dataset_metadata is None:


Was used for testing, will remove.

ivanmkc · 2020-11-04T05:24:30Z

        parent: str,
        display_name: str,
        metadata_schema_uri: str,
+        dataset_metadata: Dict,


Made explicitly non-optional as it's required in GapicDataset.

Is it possible dataset_metadata can have a stricter type signature?

Dict[str, Any] or Dict[str, Dict]?

@sasha-gitg I'm checking the proto for more guidance on the exact data structure but it just says it's a struct:

https://source.corp.google.com/piper///depot/google3/google/cloud/aiplatform/master/dataset.proto;l=69

ivanmkc · 2020-11-04T06:29:26Z


-        # If an import source was not provided, return empty created Dataset.
-        if not source:
+        if gcs_source and not is_tabular_dataset_metadata:


Tabular datasets must be imported at creation time.

ivanmkc · 2020-11-04T06:29:43Z

        api_client = cls._instantiate_client(location=location, credentials=credentials)

+        # If this is tabular enrich the dataset metadata with source
+        # TODO: Use interfaces to abstract away gcs and bq specific logic


May experiment with this in a later PR.

ivanmkc · 2020-11-04T06:32:45Z

                that can be used here are found in gs://google-cloud-
                aiplatform/schema/dataset/metadata/.
-            source: Optional[Sequence[str]]=None:
+            gcs_source: Optional[Sequence[str]]=None:


gcs_source seems to be an array or URI's while bq_source seems to be a single URI.

Combining them into one parameter seems messy (even if both were the same datatype) so I split them.
By messy, I mean it'll be unclear what datatype to use and there will have to be logic to validate BQ and GCS types. It seems cleaner to use the type system to differentiate them. I can combine them into a single source parameter again once I refactor it to use a single interface.

Also, there's specific logic that changes depending on bq_source and gcs_source which I feel so really be abstracted out to decouple Dataset from GCS and BQ, as well as allow future extensibility. I'll put a PR together later for discussion.

tswast · 2020-11-04T15:00:52Z

                information on wildcards, see
                https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames.
+            bq_source: Optional[str]=None:
+                BigQuery URI to the input dataset.


What's meant by URI? Could you provide an example?

Also, does "dataset" refer to a BigQuery dataset or table?

It should be a BigQuery table, let me make it more explicit.

It's one of these, what's our term for this?

bq://bigquery-public-data.austin_311.table123

Hrm. I've never seen a BQ URI like that before.

@shollyman Does this look familiar to you? Are there other places we use the bq:// pseudo-protocol?

You can see an example here: https://cloud.google.com/automl/docs/reference/rpc/google.cloud.automl.v1beta1#google.cloud.automl.v1beta1.BigQuerySource

Spoke with the MBSDK team and they said the format I provided is quite common. Have you seen another version?

BQ itself doesn't use this prefix format in any place I'm aware. Storage resources have multiple formats (TableReference), OP style string names (projects/p/datasets/d/tables/t), sql identifiers (p.d.t) depending on where they're getting consumed.

I can't find any reference to it in BQ code, but there's lots of references from aiplatform things, and related things in the ML space (kaggle).

ivanmkc · 2020-11-04T23:02:16Z

+                    "input_config": {"bigquery_source": {"uri": bq_source}}
+                }
+
+        # TODO: Remove this and let the error propagate from up from downstream?


I'm about to remove this until more discussion regarding the validation. See PR description. Thoughts?

I think it's safe to keep.

ivanmkc · 2020-11-04T23:06:29Z

                is defined as an OpenAPI 3.0.2 Schema Object. The schema files
                that can be used here are found in gs://google-cloud-
                aiplatform/schema/dataset/metadata/.
+            TODO: Add dataset_metadata info


Do I just write this doc string or do I loop in a TW?

ivanmkc · 2020-11-04T23:23:35Z

+            display_name=_TEST_DISPLAY_NAME,
+            metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_NONTABULAR,
+            labels=_TEST_LABEL,
+            metadata={},


Note the lack of meta-data here.

ivanmkc · 2020-11-04T23:25:04Z

-            metadata_schema_uri=_TEST_METADATA_SCHEMA_URI,
+            metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR,
            labels=_TEST_LABEL,
+            metadata=_TEST_METADATA_TABULAR_BQ,


Note that creating a tabular dataset sets the metadata field.

sasha-gitg

Minor comments. Looks good!

sasha-gitg · 2020-11-05T17:46:46Z

@@ -23,6 +23,7 @@
 from google.cloud.aiplatform import base


View this conversation on imports: #42 (comment)

Thanks for the tip. Don't think there is anything to do here.

sasha-gitg · 2020-11-05T17:51:31Z

+                    "input_config": {"bigquery_source": {"uri": bq_source}}
+                }
+
+        # TODO: Remove this and let the error propagate from up from downstream?


I think it's safe to keep.

sasha-gitg · 2020-11-05T17:57:10Z

        parent: str,
        display_name: str,
        metadata_schema_uri: str,
+        dataset_metadata: Dict,


Is it possible dataset_metadata can have a stricter type signature?

Dict[str, Any] or Dict[str, Dict]?

sasha-gitg · 2020-11-05T17:57:51Z

        gapic_dataset = GapicDataset(
            display_name=display_name,
            metadata_schema_uri=metadata_schema_uri,
+            metadata=dataset_metadata,


@dizcology Why don't we have to use json_format.ParseDict here?

As to your other question, it seems to work fine without json_format.ParseDict.

By works fine, I mean a dataset is created correctly on the cloud.

sasha-gitg · 2020-11-05T17:58:21Z

        )

-    def _import(
+    def _import_gcs(


_import_from_gcs

pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests

…unction

…ion (googleapis#40) * feat: Add tabular import support from GCS and BQ * fix: Ran linter again with following commands pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests * fix: Moved instantiation after validation and renamed a _import_gcs function Co-authored-by: Ivan Cheung <ivanmkc@google.com>

ivanmkc requested a review from a team November 4, 2020 02:25

google-cla Bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 4, 2020

ivanmkc changed the base branch from master to dev November 4, 2020 02:25