Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] Install NVIDIA Container Toolkit #1025 #1067

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Jun 30, 2023

Continuation of #1025

@cjac
Copy link
Contributor Author

cjac commented Jun 30, 2023

TO DO:

  • ensure that the DOCKER optional component configures everything such that it can be used with yarn containers launched using nvidia-docker from the container toolkit
  • Develop a working example of creating a cluster using the DOCKER optional component which is able to launch successfully completed pyspark jobs:
    Create the cluster
  time gcloud dataproc clusters create ${CLUSTER_NAME} \
    --optional-components DOCKER \
    --scopes 'https://www.googleapis.com/auth/cloud-platform'

Launch the job

gsutil cp test.py gs://${BUCKET}/
gcloud dataproc jobs submit pyspark \
  --properties="spark:spark.executor.resource.gpu.amount=1" \
  --properties="spark:spark.task.resource.gpu.amount=1" \
  --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
  --cluster=${CLUSTER_NAME} \
  --region ${REGION} gs://${BUCKET}/test.py

test.py:

# Copyright 2022,2023 Google LLC and contributors
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("torch - tensorflow").getOrCreate()

import torch
print("get CUDA details : == : ")
use_cuda = torch.cuda.is_available()
if use_cuda:
    print('__CUDNN VERSION:', torch.backends.cudnn.version())
    print('__Number CUDA Devices:', torch.cuda.device_count())
    print('__CUDA Device Name:',torch.cuda.get_device_name(0))
    print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)

import tensorflow as tf
print("Get GPU Details : ")
print(tf.test.is_gpu_available())

if tf.test.gpu_device_name():
    print('Default GPU Device:{}'.format(tf.test.gpu_device_name()))
    print("Please install GPU version of TF")

gpu_available = tf.test.is_gpu_available()
print("gpu_available : " + str(gpu_available))

is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
print("is_cuda_gpu_available : " + str(is_cuda_gpu_available))

is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0))
print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3))

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']


print("Run GPU Funtions Below : ")
print(get_available_gpus())
  • patch install_gpu_driver.sh to check whether the DOCKER optional component has been enabled and on true, trigger the installation and testing of the nvidia container toolkit.

@cjac
Copy link
Contributor Author

cjac commented Jun 30, 2023

hmmm... maybe we should patch in a metadata argument to the installer: driver-only
if set to true, do not install cuda or any of the other analytics infrastructure on the worker itself. These will be assumed to be installed in the container on which the workload will execute.

Clusters with this argument included will not be able to perform hardware assisted workloads directly. Jobs which expect hardware assisted workloads will need to manually install the libraries themselves or better yet, execute in a container.

@jayadeep-jayaraman
Copy link
Collaborator

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants