Installation Guide¶
Prerequisites¶
Python >= 3.10 & <= 3.12
Git (for source installation)
uv (recommended package installer)
uv can be installed by:
# Using curl
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or using pip
pip install uv
Basic DJ Installation¶
Data-Juicer is now available on PyPI. The minimal installation includes core data processing capabilities:
uv pip install py-data-juicer
This provides:
Data loading and manipulation
File system operations
Parallel processing
Basic I/O and utilities
Scenario-based Installation¶
For component details, plz refer to pyproject.toml.
Core ML & DL
# Generic ML/DL capabilities
uv pip install "py-data-juicer[generic]"
Includes: PyTorch, Transformers, VLLM, etc.
Domain-Specific Features
# Computer Vision
uv pip install "py-data-juicer[vision]"
# Natural Language Processing
uv pip install "py-data-juicer[nlp]"
# Audio Processing
uv pip install "py-data-juicer[audio]"
**Additional Components**
```bash
# Distributed Computing
uv pip install "py-data-juicer[distributed]"
# AI Services & APIs
uv pip install "py-data-juicer[ai_services]"
**Development Tools**
```bash
# Development & Testing
uv pip install "py-data-juicer[dev]"
Common Installation Patterns¶
1. Text Processing Setup
uv pip install "py-data-juicer[generic,nlp]"
2. Vision Processing Setup
uv pip install "py-data-juicer[generic,vision]"
3. Full Processing Pipeline
uv pip install "py-data-juicer[generic,nlp,vision,distributed]"
4. Complete Installation
# Install all features (except sandbox)
uv pip install "py-data-juicer[all]"
5. For Development Mode
For contributors and developers:
# Clone repository
git clone https://github.com/modelscope/data-juicer.git
cd data-juicer
# Install dev dependencies
uv pip install -e ".[dev]"
Installation for Specific OPs¶
Besides the scenarios-based installation, we also provide OP-based and recipe-based manners.
Install dependencies for specific OPs
With the growth of the number of OPs, the dependencies of all OPs become very heavy. Instead of using the command pip install -v -e .[all]
to install all dependencies,
we provide two alternative, lighter options:
Automatic Minimal Dependency Installation: During the execution of Data-Juicer, minimal dependencies will be automatically installed. This allows for immediate execution, but may potentially lead to dependency conflicts.
Manual Minimal Dependency Installation: To manually install minimal dependencies tailored to a specific execution configuration, run the following command:
# only for installation from source python tools/dj_install.py --config path_to_your_data-juicer_config_file # use command line tool dj-install --config path_to_your_data-juicer_config_file
Installation Using Docker¶
You can
either pull our pre-built image from DockerHub:
docker pull datajuicer/data-juicer:<version_tag>
if you can not connect ot DockerHub, please use other registry mirrors (you can find some from the Internet):
docker pull <other_registry_mirror>/datajuicer/data-juicer:<version_tag>
or run the following command to build the docker image including the latest
data-juicer
with provided Dockerfile:docker build -t datajuicer/data-juicer:<version_tag> .
The format of
<version_tag>
is likev0.2.0
, which is the same as the release version tag.
Notes & Troubleshooting¶
installation check
import data_juicer as dj
print(dj.__version__)
Modular Installation
Install only what you need
Combine components as required
Use
all
for complete installation
Sandbox Environment
Separate installation for experimental features
Will be provided as micro-services in future
For Video-related Operators
Before using video-related operators, FFmpeg should be installed and accessible via the $PATH environment variable.
You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on Debian/Ubuntu, brew install ffmpeg on OS X) or visit the official ffmpeg link.
Check if your environment path is set correctly by running the ffmpeg command from the terminal.
Getting Help
Plz check documentation/issues first
Create GitHub issues when necessary
Join community channels for discussions