Name		Name	Last commit message	Last commit date
parent directory ..
src/main/java/com/google/cloud/bigtable/dataflow/example		src/main/java/com/google/cloud/bigtable/dataflow/example
README.md		README.md
pom.xml		pom.xml

README.md

Dataflow-based Sequence File Import Example

Here is an example of importing sequence files into Google Cloud Bigtable using Dataflow.

Project Setup

Follow the instructions here to

provision your project for Cloud Dataflow
provision your project for Cloud Bigtable
create a Google Cloud Stroage bucket
create a Bigtable table

Export an HBase table as sequence files using Hadoop or HBase command.

Arguments

The user must provide the following command line arguments:

bigtableProjectId: id of the bigtable project that has the target table.
bigtableInstanceId: instance id of the bigtable project.
runner: specifies where dataflow job is run. If the input files are on local file system, use "DirectPipelineRunner". If the input files are on google cloud, "BlockingDataflowPipelineRunner" will launch the job and wait for its completion.
project: id of the cloud project with which to run dataflow jobs. If not provided, the application will use the bigtableProjectId
stagingLocation: specifies the GCS location where run-files for the dataflow job should be uploaded to.
filePattern: specifies the location of the input file(s). This may be either the path to a directory (e.g., gs://mybucket/my-hbase-sequence-file-folder/) or files (e.g., /tmp/part-m*)
HBase094DataFormat: optional argument. If input files were exported in HBase 0.94 or earlier, set this argument to 'true' so that the correct deserializer may be chosen. The default is 'false'.

Running the tool

Arguments may be hardcoded in pom.xml or as overriding properties on maven command line. The following command supplies all arguments from command line:

mvn package
mvn exec:exec -DImportByDataflow -Ddataflow.project=${DATAFLOW_PROJECT} \
     -Dbigtable.projectID=${BIGTABLE_PROJECT} -Dbigtable.instanceID=${BIGTABLE_INSTANCE} \
     -Dgs=${GCS_BUCKET} \
    -Ddataflow.staging.location=${GCS_BUCKET}/import-examples/staging" \
    -Ddataflow.runner=DirectPipelineRunner -Dbigtable.table=${BIGTABLE_TABLE} \
    -Dfile.pattern=${INPUT_FILE_PATTERN}

The following command supplies runner, bigtable name and input file location from command line, assuming other properties are hardcoded in pom.xml:

mvn package
mvn exec:exec -DImportByDataflow -Ddataflow.runner=DirectPipelineRunner \
 -Dbigtable.table=${BIGTABLE_TABLE} -Dfile.pattern=${INPUT_FILE_PATTERN}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataflow-import-examples

dataflow-import-examples

src/main/java/com/google/cloud/bigtable/dataflow/example

src/main/java/com/google/cloud/bigtable/dataflow/example

README.md

README.md

pom.xml

pom.xml

README.md

Dataflow-based Sequence File Import Example

Project Setup

Arguments

Running the tool

Files

dataflow-import-examples

Directory actions

More options

Directory actions

More options

Latest commit

History

dataflow-import-examples

Folders and files

parent directory

src/main/java/com/google/cloud/bigtable/dataflow/example

src/main/java/com/google/cloud/bigtable/dataflow/example

README.md

README.md

pom.xml

pom.xml

README.md

Dataflow-based Sequence File Import Example

Project Setup

Arguments

Running the tool