Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqio_cache_tasks fails on DataflowRunner #109

Open
bzz opened this issue Sep 19, 2021 · 3 comments
Open

seqio_cache_tasks fails on DataflowRunner #109

bzz opened this issue Sep 19, 2021 · 3 comments

Comments

@bzz
Copy link

bzz commented Sep 19, 2021

When trying to cache a dataset that does not fit DirectRunner (e.g google-research/text-to-text-transfer-transformer#323 (comment)) on Cloud Dataflow without any requirements.txt, like

python -m seqio.scripts.cache_tasks_main \
 --module_import="..." \
 --tasks="${TASK_NAME}" \
 --output_cache_dir="${BUCKET}/cache" \
 --alsologtostderr \
 --pipeline_options="--runner=DataflowRunner,--project=$PROJECT,--region=$REGION,--job_name=$TASK_NAME,--staging_location=$BUCKET/binaries,--temp_location=$BUCKET/tmp,--experiments=shuffle_mode=appliance"

it fails with ModuleNotFoundError: No module named 'seqio'.

If seqio added with

echo seqio > /tmp/beam_requirements.txt

# and run the same, adding to `--pipeline_options`
--requirements_file=/tmp/beam_requirements.txt

it fails with

subprocess.CalledProcessError: Command '['.../.venv/bin/python', '-m', 'pip', 'download', '--dest', '..../pip-tmp
/dataflow-requirements-cache', '-r', '/tmp/beam_requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.

 Pip install failed for package: -r
 Output from execution of subprocess: b"ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)\
nERROR: No matching distribution found for tensorflow-text

This seems to be cause by seqio depending on tensorflow-text, which does not have any source release artifacts.

But requirements cache in Apache Beam seem to be populated with --no-binary :all: before making it available to the workers.

A try on a clean venv results in the same:

pip3 install  --no-binary :all: --no-deps tensorflow-text==2.6.0
ERROR: Could not find a version that satisfies the requirement tensorflow-text==2.6.0 (from versions: none)
ERROR: No matching distribution found for tensorflow-text==2.6.0

Am I doing something wrong, or how does everyone work this around? Would appreciate a hand here.

@bzz
Copy link
Author

bzz commented Sep 22, 2021

In case anyone else stumbles upon this or lands though the search - kind people at Apache Beam community have pointed out https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython

Adding pip install seqio as a custom command in setup.py and passing that though --setup_file=$PWD/setup.py did the trick.

I'll be happy to submit a doc patch with instructions in case anyone points me to the right place to put it.

@adarob
Copy link
Member

adarob commented Nov 8, 2021

@bzz please do add these details to the README

@adarob adarob reopened this Nov 8, 2021
@marcospiau
Copy link

Just to add another possible solution, after some time trying, what finally worked for me was a combination of setup.py and use of custom docker image for Dataflow workers (https://cloud.google.com/dataflow/docs/guides/using-custom-containers). setup.py is used only to package necessary code for tasks and preprocessors definition, and other requirements (including seqio and t5) can be specified on Dockerfile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants