分布式机器学习dask

xiaoxiao2022-07-14 203

文章目录

Dask组成特性Install DaskSetupuse case 分布式机器学习

Dask

Dask是一个数据分析的并行计算的框架。

已经集成了现有的框架，比如：numpy，pandas，scikit-learn，xgboost，lightGBM等

API与已有框架的API一致

可以扩展到上千个节点，也可以在笔记本上使用

有低阶API可供用户定制化

组成

动态任务调度（Dynamic task scheduling）优化交互计算的工作量，与Airflow，Luigi，Celery或Make类似“大数据”集合（Big Data Collection）扩展了NumPy，pandas，Python iterators可以处理比内存大的数据及在分布式的环境上

特性

Familiar：数据结构一致

Flexible：提供了一个任务调度接口，可以定制化的集成其它算法

Native：纯python环境

Fast：减少了工作量，增加了并行计算，速度更快

Scales up：可以扩展到1000+cores

Scales down：可以在laptop上使用

Responsive：有诊断系统，反馈更及时

有一个Task Graph，与spark的类似

Install Dask

conda安装完全安装，包含了所有的信赖，比如numpy,pandas conda install dask只安装内核 conda install dask-core 与pip install dask一样 pip安装 pip install “dask[compete]” # Install everything pip install dask # Install only corecluster 部署 # 安装dask 1.2.2 conda install dask==1.2.2 或者 pip install dask[complete]==1.2.2 # 启动scheduler进程，并挂后台 nohup dask-scheduler --host 172.16.36.20 & # 启动worker进程，指定scheduler的地址是203，端口是8786，代码中提交的端口也是8786,并挂后台 nohup dask-worker --name work-01 172.16.36.20:8786 & # 关闭防火墙就可以通过8787端口查看集群状态 sudo systemctl status firewalld # 查看防火墙状态，加d是服务 sudo systemctl stop firewalld # 关闭防火墙 http://172.16.36.30:8787/status

Setup

Dask有两种task scheduler

Single machine scheduler：

默认 scheduler，不用设置调用compute()方法，使用默认scheduler示例 import dask.dataframe as dd df = dd.read_csv(...) df.x.sum().compute() # This uses the single-machine scheduler by default

Distributed scheduler

需要设置一个Client示例from dask.distributed import Client client = Client(...) # Connect to distributed cluster and override default df.x.sum().compute() # This now runs on the distributed system

use case

分为两类：

Collection example：单机处理Large Numpy/Pandas/list，类似于spark。目前80+%的Dask用户是使用这种类型。Custom example：自定义任务调度（Custom task scheduler），类似于Luigi，Airflow，Celery或Makefiles

最新回复(0)