Getting Started Using Python with Vantage
Teradata Vantage™ supports Python languages both at client side and in database.
The client side and in-database options for Python enable users to jointly perform analytics with no SQL coding required, provide users with functions for data manipulation and transformation, data filtering and sub-setting, and can be used in conjunction with open source libraries.
The Teradata Package for Python, teradataml, includes access to powerful Vantage analytic functions together with data manipulation functions to provide extensive in-database analytic capabilities at client side. They combine the benefits of the open source language environment with the massive parallel processing capabilities of Vantage.
teradataml processes data directly in Vantage, without bringing the data onto the client. This allows teradataml to be able to process a huge amount of data that is not limited by the available memory and resources on the client machine. This also eliminates the need to transfer data between the client and Vantage, which saves time.
The language interpreters are installed on a client machine that connects to Vantage.
The teradataml package uses SQLAlchemy and provides functions that are similar to pandas.
The Teradata distribution packages support in-database Python script execution with the SCRIPT Table Operator (STO). The distribution packages offer updated and self-contained versions of the following two RPM packages:
- teradata-python package: Python interpreter for in-database language execution
- teradata-python-addons package: collections of select popular add-on libraries
The language interpreters and corresponding Python add-on libraries RPM packages are installed on each Advanced SQL Engine node.
The Python interpreter and add-ons packages bundle supports native, in-database execution of Python scripts on the Advanced SQL Engine using the SCRIPT Table Operator.
Installing Jupyter Notebook
The use case examples in this guide use Jupyter Notebook.
If you do not have Jupyter Notebook installed on your computer, follow the instructions on https://jupyter.org/install.html to install it.
Installing Teradata Package for Python
Teradata Package for Python System Requirements
Required Software on Teradata Vantage
Based on the connection to Vantage, the teradataml package requires the following minimum software versions be installed on Vantage:
Teradata Vantage with Advanced SQL Engine Release 16.20.16.01 or later
Required Software on Client
The teradataml package also requires a minimum version of the following software be installed on the client:
- teradatasqlalchemy 17.00.00.00
- teradatasql 17.00.00.02
- pandas 0.22.00
- Operating Systems:
- Windows OS: Windows 7
- macOS: os x 10.9
- Linux: Ubuntu 16.04, CentOS 7.0, RHEL 7.1, or SLES 12
Installing Teradata Package for Python
Python must be installed on the client system before you can install the teradataml package.
The minimum version supported is Python 3.5.0.
Use the following command to install the teradataml package:
pip install teradataml
teradataml has a dependency on teradatasqlalchemy and teradatasql. Executing this command will also download and install sqlalchemy, teradatasqlalchemy, and teradatasql, if they are not yet installed.
teradatasqlalchemy, and teradatasql can also be separately installed from https://pypi.org/user/teradata/.
Installing Python Interpreter and Add-ons Packages Bundle
Installation of the Teradata In-Database Python packages (Interpreter and Add-ons RPM files) is an elaborate process. The target Advanced SQL Engine needs to be checked in advance for adequate resources, and some related system parameters may need to be updated accordingly before the RPM files can be installed.
For this reason, Teradata strongly recommends enlisting Teradata Customer Services to install the Python language interpreters and corresponding Python add-on libraries.
Please contact your Teradata account representative to order this service.
Permissions Required to Use Script Table Operator
To use the Script Table Operator, certain permissions need to be granted to the users by their Database Administrator, as follows:
Permission to run Script Table Operator:
- GRANT EXECUTE FUNCTION ON td_sysfnlib.script TO <username>;
- GRANT CREATE EXTERNAL PROCEDURE ON <database> TO <username>;
- GRANT DROP PROCEDURE ON <database> TO <username>;
- GRANT SELECT ON <database> TO <username>;
- GRANT EXECUTE ON SYSUIF.DEFAULT_AUTH to <username>;
Permissions to run install_file and remove_file:
- GRANT EXECUTE PROCEDURE ON SYSUIF.INSTALL_FILE TO <username>;
- GRANT EXECUTE PROCEDURE ON SYSUIF.REPLACE_FILE TO <username>;
- GRANT EXECUTE PROCEDURE ON SYSUIF.REMOVE_FILE TO <username>;
Additional permissions needed to run the SCRIPT Table Operator via teradataml, if the user will be working through a <database> other than the <username> database:
- GRANT EXECUTE FUNCTION ON td_sysfnlib.script TO <database> WITH GRANT OPTION;
- GRANT EXECUTE ON SYSUIF.DEFAULT_AUTH TO <database> WITH GRANT OPTION;
Use Cases for Python
Getting Started: Uploading and Analysis
This use case shows how to use the teradataml package installed on the client to prepare data and run analytics.
The dataset used in this use case is included in the teradataml package and uploaded to Vantage using the teradataml function.
In the attached zip file, the Jupyter notebook for this use case includes the following sections:
- Section 0: Loading teradataml modules and connecting to Vantage from the client
- Section 1: Uploading the example dataset included in the teradataml package to Vantage and preparing data for analysis
- Sections 2 and 3: Running Churn Analysis using Sessionize and NPath functions from the teradataml package
- Section 2 focuses on non-churn customers
- Section 3 focuses on churn customers
Before running the example Jupyter notebook GettingStarted.ipynb for this use case, which is included in the attached zip file DataScience-Python_UseCases.zip, replace the <host>, <username>, <password> and <database> with the actual host, username, password and database for your system.
Uploading Data using load_example_data() Function
The dataset used in this example is included in the teradataml package.
The teradataml package offers various functions and each function provides some examples. You need to use the load_example_data() helper function to upload the sample datasets to Vantage to test these examples.
For this example, use the following command to upload the sample dataset to Vantage and create a table “retail_churn_table”.
The load_example_data() function can only be used in a restricted way. See Teradata Package for Python User Guide, B700-4006 for details on restrictions and usage of the load_example_data() function.
Running Churn Analysis
After Uploading Data using load_example_data() Function and preparing the data for analysis, you can perform churn analysis for both churn and non-churn customer.Two teradataml analytic functions are used:
- The Sessionize function maps each click in a session to a unique session identifier.
- The NPath function performs Pathing on the outcome of Sessionize.
Analyzing with In-Database Python and STO
This use case shows how to use teradataml functions to prepare data for analysis, and then run analytics with In-DB Python interpreters and add-on packages through the SCRIPT Table Operator (STO).
The datasets used in this use case are CSV files that we will upload to Vantage using Teradata Studio.
Before running the example Jupyter notebook AnalyzingSTO.ipynb for this use case, extract the attached zip file DataScience-Python_UseCases.zip and make the following changes:
- In the Jupyter notebook AnalyzingSTO.ipynb:
- Replace the <host>, <username>, <password> and <database> with the actual host, username, password and database for your system;
- Replace the <files_local_path> with the actual local path where you saved the Python script file.
- In the Python script file ex1pSco.py, replace the <database> with the actual database for your system.
Uploading Data Files using Teradata Studio
The data files used in this example are three CSV files included in the attached zip file DataScience-Python_Datafiles.zip.
Follow the steps below to upload the files to Vantage using Teradata Studio.
See Teradata Studio™ User Guide, B035-2041 for details on how to use Teradata Studio.
Repeat these steps for all the three data files in the attachment:
- Launch the software and establish a connection to the target Vantage system.
- Navigate to the database where you want to create the three tables that are used to store the CSV data.
- Click the database name to expand the list of objects in it.
- Right click Tables in the desired target database and select Teradata > Data Transfer.
- In the Data Transfer Wizard window,
- a. Use the “External File” source option that uses the Smart Load utility;b. Click the “Launch” button.
- In the Import Data window:
- Ensure the “Column Labels in First Row” button is checked.
- Specify “Comma” under File Options > Column Delimiter.
- Specify “None” under File Options > Character String Delimite.
- Specify “Windows OS” under File Options > Line Separator. Or choose the operating system you have.
- Click “Finish” button at the bottom.
- The Table Column Data Types window displays a sample of the data and their interpreted types in the manner Teradata Studio plans to import them into the database. Use the information in the attached “Table DDL details for the data” file to make any changes as needed in this window.
- The SQL Preview window displays a preview of the table DDL based on your preceding input. Ensure this preview matches the specifications provided in the attached “Table DDL details for the data” file.
- Click “Finish”.
- The data upload begins automatically. Wait until the upload is complete.
- After the upload, verify in Teradata Studio that the tables are present in the desired destination locations.
Preparing Data for Analysis
After Uploading Data Files using Teradata Studio, you will use teradataml functions to perform a series of operations on the data, to produce an Analytic Data Set (ADS) with the features needed for the analysis.
Scoring using Script Table Operator
After Uploading Data Files using Teradata Studio and then Preparing Data for Analysis, you can perform scoring on the test dataset by in-database execution of a Python script on the Advanced SQL Engine using Script Table Operator.
The in-database execution of a Python script is supported by the Python interpreter and add-ons packages installed in the Advanced SQL Engine.