1、Jupyter Notebook介绍、安装及使用教程,包括切换环境内核及添加扩展功能:
参考博客:Jupyter Notebook介绍、安装及使用教程https://www.jianshu.com/p/91365f343585
2、pip安装jupyter_contrib_nbextension扩展功能,我这里conda指令安装出现问题,所以使用pip进行安装,也可以下载好包并使用指令$:python ./setup.py install进行安装
参考博客:Jupyter Notebook安装jupyter_contrib_nbextensions扩展功能后不显示Nbextensions标签的解决办法(常用扩展功能说明)https://blog.csdn.net/u011318077/article/details/85475622
1、Jupyter与Spark结合使用:
参考博客:Jupyter配置Spark开发环境https://blog.csdn.net/u012948976/article/details/52372644
参考apache官网toree:https://toree.apache.org/docs/current/user/quick-start/
注:toree新版本已经不支持pyspark连接,需要连接pyspark,请看第2小节内容。
1】安装toree,pip install toree
2】使用指令:jupyter toree install --spark_home=/home/app/spark(spark安装路径)
3】启动jupter notebook,anaconda3_home/bin/jupyter notebook –allow-root
import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val conf = new SparkConf().setMaster("spark://172.0.0.1:7077").setAppName("anchor_usr_gold_info") val sc= new SparkContext(conf) val name="word" println(name)2、Jupyter与pySpark结合使用:
参考博客:Apache Spark Tutorial: ML with PySpark:https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning
1】安装pyspark,pip install pyspark(注意pyspark包与spark集群版本对应,
pip install SomePackage # 最新版本 pip install SomePackage==1.0.4 # 指定版本 pip install 'SomePackage>=1.0.4' # 最小版本)
2】以内核python3,创建notebook 文件
3】导入pyspark相关包,便可以编写pyspark程序
# coding: utf-8 from pyspark import SparkContext, SparkConf, SQLContext from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * from pyspark import StorageLevel from pyspark.sql import Window import numpy as np import pandas as pd from pyspark.sql import Window from datetime import date import warnings warnings.filterwarnings("ignore") %matplotlib inline conf = (SparkConf() .setAppName("adCTR")\ .setMaster("spark://172.0.0.1:7077")\ .set("spark.driver.memory","12G") .set("spark.driver.maxResultSize","8G")\ .set("spark.executor.cores", "4") ) sc = SparkContext(conf=conf) spark = SparkSession(sc) # train = spark.read.csv("file:///home/jupyterFile/huaweiCTR/data/original_data/train_20190518.csv", inferSchema = 'true') train = spark.read.csv("hdfs:///huaweiCTR/original_data/train_20190518.csv", inferSchema = 'true')请注意,如果出现类似于此文件的FileNotFoundError的错误:“No such file or directory:‘/User/YourName/Downloads/spark-2.1.0-bin-hadoop2.7/./bin/spark-submit’”你必须(重新)设置你的Spark PATH。执行后转到您的主目录$ cd,然后.bash_profile通过运行编辑该文件,即(/root/.bash_profile,每个人的可能不一样)。添加以下内容,并生效./bash_profile文件: