Installation Guide¶
Prerequisites¶
Python >= 3.10
Git (for source installation)
uv (recommended package installer)
Basic DJ Installation¶
Data-Juicer is now available on PyPI. The minimal installation includes core data processing capabilities:
pip install py-data-juicer
This provides:
Data loading and manipulation
File system operations
Parallel processing
Basic I/O and utilities
Scenario-based Installation¶
For component details, plz refer to pyproject.toml.
Core ML & DL
# Generic ML/DL capabilities
pip install "py-data-juicer[generic]"
Includes: PyTorch, Transformers, VLLM, etc.
Domain-Specific Features
# Computer Vision
pip install "py-data-juicer[vision]"
# Natural Language Processing
pip install "py-data-juicer[nlp]"
# Audio Processing
pip install "py-data-juicer[audio]"
**Additional Components**
```bash
# Distributed Computing
pip install "py-data-juicer[distributed]"
# AI Services & APIs
pip install "py-data-juicer[ai_services]"
**Development Tools**
```bash
# Development & Testing
pip install "py-data-juicer[dev]"
Common Installation Patterns¶
1. Text Processing Setup
pip install "py-data-juicer[generic,nlp]"
2. Vision Processing Setup
pip install "py-data-juicer[generic,vision]"
3. Full Processing Pipeline
pip install "py-data-juicer[generic,nlp,vision,distributed]"
4. Complete Installation
# Install all features (except sandbox)
pip install "py-data-juicer[all]"
5. For Development Mode
For contributors and developers:
# Clone repository
git clone https://github.com/modelscope/data-juicer.git
cd data-juicer
# Install dev dependencies
pip install -e ".[dev]"
# Optionally, use uv for venv and dependency management
curl -LsSf https://astral.sh/uv/install.sh | sh # install uv
uv venv --python 3.10 # initialize virtual env with python 3.10
source .venv/bin/activate # activate virtual env
uv pip install -e . # install minimal dependencies
Installation for Specific OPs¶
Besides the scenarios-based installation, we also provide OP-based and recipe-based manners.
Install dependencies for specific OPs
With the growth of the number of OPs, the dependencies of all OPs become very heavy. Instead of using the command pip install -v -e .[all]
to install all dependencies,
we provide two alternative, lighter options:
Automatic Minimal Dependency Installation: During the execution of Data-Juicer, minimal dependencies will be automatically installed. This allows for immediate execution, but may potentially lead to dependency conflicts.
Manual Minimal Dependency Installation: To manually install minimal dependencies tailored to a specific execution configuration, run the following command:
# only for installation from source python tools/dj_install.py --config path_to_your_data-juicer_config_file # use command line tool dj-install --config path_to_your_data-juicer_config_file
Installation Using Docker¶
You can
either pull our pre-built image from DockerHub:
docker pull datajuicer/data-juicer:<version_tag>
if you can not connect ot DockerHub, please use other registry mirrors (you can find some from the Internet):
docker pull <other_registry_mirror>/datajuicer/data-juicer:<version_tag>
or run the following command to build the docker image including the latest
data-juicer
with provided Dockerfile:docker build -t datajuicer/data-juicer:<version_tag> .
The format of
<version_tag>
is likev0.2.0
, which is the same as the release version tag.
Notes & Troubleshooting¶
installation check
import data_juicer as dj
print(dj.__version__)
Modular Installation
Install only what you need
Combine components as required
Use
all
for complete installation
Sandbox Environment
Separate installation for experimental features
Will be provided as micro-services in future
For Video-related Operators
Before using video-related operators, FFmpeg should be installed and accessible via the $PATH environment variable.
You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on Debian/Ubuntu, brew install ffmpeg on OS X) or visit the official ffmpeg link.
Check if your environment path is set correctly by running the ffmpeg command from the terminal.
Getting Help
Plz check documentation/issues first
Create GitHub issues when necessary
Join community channels for discussions