feat: Added tabular data support from BQ and GCS during dataset creation#40
Conversation
| my_dataset = Dataset.create( | ||
| display_name=_TEST_DISPLAY_NAME, | ||
| source=_TEST_SOURCE_URI, | ||
| gcs_source=_TEST_SOURCE_URI_GCS, |
There was a problem hiding this comment.
Renaming for clarity
|
|
||
| @pytest.mark.usefixtures("get_dataset_mock") | ||
| def test_create_dataset(self, create_dataset_mock): | ||
| def test_create_dataset_nontabular(self, create_dataset_mock): |
There was a problem hiding this comment.
Non-tabular dataset doesn't need metadata set for it.
| parent=_TEST_PARENT, dataset=expected_dataset, metadata=() | ||
| ) | ||
|
|
||
| # TODO: test_create_dataset_nontabular_with_bq_source <- should raise a value error |
There was a problem hiding this comment.
TODO later in this PR
| "gs://my-bucket/index_file_2.jsonl", | ||
| "gs://my-bucket/index_file_3.jsonl", | ||
| ] | ||
| _TEST_SOURCE_URI_BQ = "bigquery://my-project/my-dataset" |
There was a problem hiding this comment.
Separated BQ and GCS sources for testing
| @@ -0,0 +1,9 @@ | |||
| class training_job: | |||
There was a problem hiding this comment.
Added schema from Sasha's branch: https://github.com/sasha-gitg/python-aiplatform/blob/custom_training/google/cloud/aiplatform/schema.py
| An object representing a long-running operation. | ||
| """ | ||
| # TODO(b/171311614): Add support for BiqQuery import source | ||
| # Should throw error if BQ source is received |
| ) | ||
|
|
||
| def _import( | ||
| def _import_gcs( |
There was a problem hiding this comment.
Made explicit that this only works with GCS
| labels=labels, | ||
| ) | ||
|
|
||
| if dataset_metadata is None: |
There was a problem hiding this comment.
Was used for testing, will remove.
79fca16 to
1d521ee
Compare
| parent: str, | ||
| display_name: str, | ||
| metadata_schema_uri: str, | ||
| dataset_metadata: Dict, |
There was a problem hiding this comment.
Made explicitly non-optional as it's required in GapicDataset.
There was a problem hiding this comment.
Is it possible dataset_metadata can have a stricter type signature?
Dict[str, Any] or Dict[str, Dict]?
There was a problem hiding this comment.
@sasha-gitg I'm checking the proto for more guidance on the exact data structure but it just says it's a struct:
|
|
||
| # If an import source was not provided, return empty created Dataset. | ||
| if not source: | ||
| if gcs_source and not is_tabular_dataset_metadata: |
There was a problem hiding this comment.
Tabular datasets must be imported at creation time.
| api_client = cls._instantiate_client(location=location, credentials=credentials) | ||
|
|
||
| # If this is tabular enrich the dataset metadata with source | ||
| # TODO: Use interfaces to abstract away gcs and bq specific logic |
There was a problem hiding this comment.
May experiment with this in a later PR.
| that can be used here are found in gs://google-cloud- | ||
| aiplatform/schema/dataset/metadata/. | ||
| source: Optional[Sequence[str]]=None: | ||
| gcs_source: Optional[Sequence[str]]=None: |
There was a problem hiding this comment.
gcs_source seems to be an array or URI's while bq_source seems to be a single URI.
Combining them into one parameter seems messy (even if both were the same datatype) so I split them.
By messy, I mean it'll be unclear what datatype to use and there will have to be logic to validate BQ and GCS types. It seems cleaner to use the type system to differentiate them. I can combine them into a single source parameter again once I refactor it to use a single interface.
There was a problem hiding this comment.
Also, there's specific logic that changes depending on bq_source and gcs_source which I feel so really be abstracted out to decouple Dataset from GCS and BQ, as well as allow future extensibility. I'll put a PR together later for discussion.
1d521ee to
4eae0ec
Compare
| information on wildcards, see | ||
| https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. | ||
| bq_source: Optional[str]=None: | ||
| BigQuery URI to the input dataset. |
There was a problem hiding this comment.
What's meant by URI? Could you provide an example?
Also, does "dataset" refer to a BigQuery dataset or table?
There was a problem hiding this comment.
It should be a BigQuery table, let me make it more explicit.
It's one of these, what's our term for this?
bq://bigquery-public-data.austin_311.table123
There was a problem hiding this comment.
Hrm. I've never seen a BQ URI like that before.
@shollyman Does this look familiar to you? Are there other places we use the bq:// pseudo-protocol?
There was a problem hiding this comment.
There was a problem hiding this comment.
Spoke with the MBSDK team and they said the format I provided is quite common. Have you seen another version?
There was a problem hiding this comment.
BQ itself doesn't use this prefix format in any place I'm aware. Storage resources have multiple formats (TableReference), OP style string names (projects/p/datasets/d/tables/t), sql identifiers (p.d.t) depending on where they're getting consumed.
I can't find any reference to it in BQ code, but there's lots of references from aiplatform things, and related things in the ML space (kaggle).
| "input_config": {"bigquery_source": {"uri": bq_source}} | ||
| } | ||
|
|
||
| # TODO: Remove this and let the error propagate from up from downstream? |
There was a problem hiding this comment.
I'm about to remove this until more discussion regarding the validation. See PR description. Thoughts?
| is defined as an OpenAPI 3.0.2 Schema Object. The schema files | ||
| that can be used here are found in gs://google-cloud- | ||
| aiplatform/schema/dataset/metadata/. | ||
| TODO: Add dataset_metadata info |
There was a problem hiding this comment.
Do I just write this doc string or do I loop in a TW?
fd3e7eb to
d3c18b0
Compare
| display_name=_TEST_DISPLAY_NAME, | ||
| metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_NONTABULAR, | ||
| labels=_TEST_LABEL, | ||
| metadata={}, |
There was a problem hiding this comment.
Note the lack of meta-data here.
| metadata_schema_uri=_TEST_METADATA_SCHEMA_URI, | ||
| metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR, | ||
| labels=_TEST_LABEL, | ||
| metadata=_TEST_METADATA_TABULAR_BQ, |
There was a problem hiding this comment.
Note that creating a tabular dataset sets the metadata field.
d3c18b0 to
7c43fe3
Compare
sasha-gitg
left a comment
There was a problem hiding this comment.
Minor comments. Looks good!
| @@ -23,6 +23,7 @@ | |||
| from google.cloud.aiplatform import base | |||
There was a problem hiding this comment.
Thanks for the tip. Don't think there is anything to do here.
| "input_config": {"bigquery_source": {"uri": bq_source}} | ||
| } | ||
|
|
||
| # TODO: Remove this and let the error propagate from up from downstream? |
| parent: str, | ||
| display_name: str, | ||
| metadata_schema_uri: str, | ||
| dataset_metadata: Dict, |
There was a problem hiding this comment.
Is it possible dataset_metadata can have a stricter type signature?
Dict[str, Any] or Dict[str, Dict]?
| gapic_dataset = GapicDataset( | ||
| display_name=display_name, | ||
| metadata_schema_uri=metadata_schema_uri, | ||
| metadata=dataset_metadata, |
There was a problem hiding this comment.
@dizcology Why don't we have to use json_format.ParseDict here?
There was a problem hiding this comment.
As to your other question, it seems to work fine without json_format.ParseDict.
There was a problem hiding this comment.
By works fine, I mean a dataset is created correctly on the cloud.
| ) | ||
|
|
||
| def _import( | ||
| def _import_gcs( |
pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests
89fdd84 to
4789a0d
Compare
…ion (googleapis#40) * feat: Add tabular import support from GCS and BQ * fix: Ran linter again with following commands pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests * fix: Moved instantiation after validation and renamed a _import_gcs function Co-authored-by: Ivan Cheung <ivanmkc@google.com>
…ion (googleapis#40) * feat: Add tabular import support from GCS and BQ * fix: Ran linter again with following commands pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests * fix: Moved instantiation after validation and renamed a _import_gcs function Co-authored-by: Ivan Cheung <ivanmkc@google.com>
…ion (googleapis#40) * feat: Add tabular import support from GCS and BQ * fix: Ran linter again with following commands pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests * fix: Moved instantiation after validation and renamed a _import_gcs function Co-authored-by: Ivan Cheung <ivanmkc@google.com>
…ion (googleapis#40) * feat: Add tabular import support from GCS and BQ * fix: Ran linter again with following commands pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests * fix: Moved instantiation after validation and renamed a _import_gcs function Co-authored-by: Ivan Cheung <ivanmkc@google.com>
…ion (googleapis#40) * feat: Add tabular import support from GCS and BQ * fix: Ran linter again with following commands pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests * fix: Moved instantiation after validation and renamed a _import_gcs function Co-authored-by: Ivan Cheung <ivanmkc@google.com>
…ion (googleapis#40) * feat: Add tabular import support from GCS and BQ * fix: Ran linter again with following commands pip3 install flake8 black==19.10b0 black docs google tests noxfile.py setup.py flake8 google tests * fix: Moved instantiation after validation and renamed a _import_gcs function Co-authored-by: Ivan Cheung <ivanmkc@google.com>
Description
This PR adds BQ import support for tabular datasets during Dataset creation.
Fixes https://b.corp.google.com/issues/171311614
Testing:
Validation
I've decided to hold off on validation for now until I have a clearer picture on what to catch at dataset creation and when to wait for the underlying API to throw an error. Will do this in a later PR. This includes error cases like:
It's likely that I'll have add this validation as it seems like everything works fine at the Dataset creation step but can potentially break downstream.
Details listed here: https://docs.google.com/document/d/1eyNM8GGf-JEtsZRLCm5SVXQXddvuQeODSEEsKv-mHJ4/edit?usp=sharing