Zacary Fettig

Pine Cove Consulting MSP
Azure Solutions Architect Expert | MCSE: Core Infrastructure

Linkedin Page

Stock Market Data Prediction and Analysis

Summery - Example AWS Pipeline for predicting stock market data. Leveraging Yahoo Finance API to stream stock market ticker data into AWS and data flowing into AWS SageMaker is an AI Machine Learning Platform that builds, trains, and deploys machine learning models. Data can be streamed in and processed to predict time series data.

Project Resource Use Cases:

Python code leveraging Yahoo Finance API: Python code streaming real time stock market data into AWS Kinesis.

Amazon Elastic Container Service (ECS): Container service running the Python Stock Market Streaming Code.

Amazon Kinesis: Messaging queue services for real time streaming data.

Amazon Kinesis Firehose: Service that takes the streaming data from Kinesis and transports it to a data source in real time. In this project, Firehose is transporting data to Redshift.

Amazon Redshift: Data warehouse chosen for it's scalability in data processing. In this project, it is used to store all the collected streaming dat in one place, before distributing to other data analysis resources.

Amazon SageMaker AI: An AWS Service that provides a single interface for building, training, and visualizing machine learning models.

Amazon Glue: A Service that prepare, transform, and move data. Used in this project to make data transformations to separate out tickers into individual data sets for consumption into SageMaker.

Note: AWS Glue supports scheduling at 5-minute intervals for automating the flow of data. For real-time processing, AWS Lambda can be used as an alternative to Glue.
The same process exporting data from Redshift, transforming it by separating tickers into separate data sets, and outputting to s3 for SageMaker consumption can be implemented using Lambda for real-time data pipelines.

Streaming stock market data to AWS Kinesis

1. Create new Kinesis Data Stream

2. Name data stream and select on-demand. Later provisioned can be selected if the capacity of the data stream is known. Select create data stream at the bottom of page.

Creating ECS Container for hosting Python Code

Create new cluster

1. Select create new ECS Cluster on the ECS Cluster Resource Creation Page.

2. Name the cluster. Check AWS Fargete Serverless. Select create at bottom of page.

Create IAM role to be used as the ECS Task Role in the Task Role Definition Creation

1. Open IAM (Identity and Access Managment) console to create a new policy that will be used in the role for the Ecs Task Role connecting to the Kinesis Stream.

2. Under the select a service section, select Kinesis. New options will pop up to configure. Under write, check mark PutRecord and PutRecords. Select Add ARNs to add the specific ARN of the Kinesis Stream created in the previous section.

3. In the Specify ARNs Pop Up, select this account. Enter the Kinesis Stream ARN. The region and stream name fields will auto fill. Select add ARNs.

4. Select next at bottom of page.

5. Name Policy ECSKinesisWrite. Select create policy at bottom of page.

6. In the roles section of IAM, select create role This role will later be assigned to the ECS Task Role section.

7. Select AWS Service under trusted entity type. Select Elastic Container Service under the use case heading. Select Elastic Container Service Task from the detailed options. Select next.

8. Search for the previously created policy. ECSKinesisWrite. Check the box to add the policy to the role and select next.

9. Enter name for role and select create role.

Create task definition

1. Select create new task definition

2. Create name for task definition. Select launch type as AWS Fargate. Select os architecture as Linux/x86_64. Specify base CPU and RAM resources.

3. Select the previously created role for the task role and select create new role for the task execution role.

4. Name container. finance-docker. Enter Docker Image Repository below into the Image URI box. The container image contains the Stock Market Streaming Application. No container mappings needed for the application. Remove the container port mapping.

Docker Image Repository: zacaryfettig/yahoofinancedocker:v1

5. Add three environment variables. STREAM_NAME = your streamname. REGION_NAME = Your Region. The stream name is the Kinesis Data Stream Name created in the previous Kinesis creation instructions section. Tickers = ticker1,ticker2,ticker3,ticker4,etc.

6. Leave the rest of the settings the same and select create.

Run a new ECS Task to run Docker Image

1. On the clusters page, select the cluster. On the config page select the tasks tab. Select run a new task.

2. Keep capacity provider strategy selected.

3. Select task definition from the task definition family section. Select create at bottom of page.

4. The task starts the docker container, which will stream the stock market data to Kinesis.

Creating Redshift Instance

1. Create Redshift Serverless Database using customized settings option. Create target namespace name. Namespace is a collection of database objects and settings.

2. Select Create IAM Role on the Redshift Configuration Page.

3. Open a new aws browser tab and go to the Amazon s3 resource creation page. Create an S3 Bucket to be used in the next steps. The S3 bucket will be used in transferring data stream messages to the database detailed in the upcoming Kinesis Firehose Instructions. I have named the bucket firehose-temporary-storage for example. Name the bucket and leave everything else as default. Select create bucket.

4. In the create IAM Window, Select specific s3 buckets and the created S3 Bucket. Click create IAM role as default.

5. Select export these logs and check all.

6. A workgroup defines compute resources. See the below link to find out more about setting the Base RPU Capacity. https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-capacity.html

7. Keep the rest of the settings the same and select save configuration at the bottom. Wizard will finish creating and Redshift will be ready to use.

Creating Redshift Database Schema, Table, and Columns

Open Query editor

1. On the main Redshift Resource Page, select Query data from the upper right corner. This will bring up the query editor where internal database configurations can be made.

2. Open up the folders on the left side folder structure to view existing databases.

Create new database

1. Select the create button and select create a new database. Enter a new name for the database that will store the streaming stock data. Select create database.

Create new schema

1. A schema is a blueprint for how the database is structured. Select schema from the create menu. On the create schema page, select the database for the schema to be created, enter a schema name, select create schema.

Create new table

1. With the new database selected in the left menu, select table from the create menu.

2. Select the database, schema, and the name for the new table at the top. Next enter the column data. It's important that the columns are entered exactly the same, so that it matches the columns in the python streaming application code. column name and datatype is shown in the photo below. Encoding is set as no selection for all three columns.

3. Under the resources, a new database, schema, table will show up as created.

Creating the Firehose Stream Connection to Redshift

1. On the firehose creation screen, select Amazon Kinesis Data Streams as the source and Amazon Redshift as the destination. Choose name for the Firehose Stream.

2. Select previously created data stream. Choose serverless workgroup (redshift serverless). Select the workgroup from the Redshift creation Process.

3. Enter the name for Redshift Database created in the redshift database/schema/tables section.

4. Enter the database table name created in the Redshift database/schema/tables instructions section. Columns are the following: ticker,date,price. Enter in the same order. Select the previously created s3 bucket in the intermediate s3 bucket area. This was the bucket created when creating the Redshift Workgroup. Keep rest of the settings the same and select create Firehose Stream. copy command options are as follows for the copy command to recognize csv format: FORMAT AS CSV DELIMITER ',' QUOTE '"'

5. Once created, the main Firehose Dashboard Page has monitoring and logs sections to monitor data processing.

AWS Glue Data Source: Redshift Connector

Create IAM Role for Redshift Glue Connector

1. Open IAM Console, select the roles tab, create new role.

2. Choose Glue as the service in the use case section. This will set the IAM Role to assume the identity as the Glue Service.

3. Search and select the permissions AmazonRedshiftFullAccess, AmazonS3FullAccess, AWSGlueConsoleFullAccess, AWSGlueServiceRole, SecretsManagerReadWrite, AWSKeyManagementServicePowerUser. Once finished, select next.

4. Name the role GlueRedshiftConnectorRole. Select create role at bottom of page.

Creating VPC Endpoints

1. The next step is setting up VPC Endpoints so that each of the resources can talk to AWS Glue across the VPC. There will be 5 endpoints that will be created. In the VPC Resource Configuration Page, select Endpoints from the left side menu. Select create endpoint.

2. Name the first endpoint as s3-endpoint. From the services section search and select, Service Name = com.amazonaws.<your-region>.s3

3. Under the services section, select the service with type Gateway. Select the vpc where all the AWS Resources are located for the project. Select the route table under the route table section. Leave rest as default and select create endpoint at page bottom.

4. Create another new endpoint. This one named RedshiftEndpoint. Under service select Service Name = com.amazonaws.<your-region>.redshift. Select the vpc. Checkbox each subnet and select each subnet under the Subnet ID Column. Select the same security group that is associated with Redshift. Select create endpoint at the bottom.

4. Create another endpoint named SecretsManagerEndpoint. Under service select Service Name = com.amazonaws.<your-region>.secretsmanager. Select the vpc. Checkbox each subnet and select each subnet under the Subnet ID Column. Select the same security group that is associated with Redshift. Select create endpoint at the bottom of page.

5. Create another endpoint named KmsEndpoint. Under service select Service Name = com.amazonaws.<your-region>.kms. Select the vpc. Checkbox each subnet and select each subnet under the Subnet ID Column. Select the same security group that is associated with Redshift. Select create endpoint at the bottom of page.

6. For the last endpoint, name it StsEndpoint. Under service select Service Name = com.amazonaws.<your-region>.sts. Select the vpc. Checkbox each subnet and select each subnet under the Subnet ID Column. Select the same security group that is associated with Redshift. Select create endpoint at the bottom of page.

7. All 5 endpoints are listed.

AWS Glue Redshift Connection

1. In AWS Glue, select the data connections tab on the left menu. On the Connectors page, select create connection in the connections section towards the bottom.

2. Choose Amazon Redshift as the data source.

3. Select the Redshift Workgroup, Name of the database where the data is stored. For credentials, either enter the credentials and Glue will create the Secrets for you to store in Secrets Manager or select an existing secret. Select the previously created GlueRedshiftConnectoRole.

4. Select Next. Name the connection and create the connection.

5. Once creation is complete the new connection will show up under connections.

Creating AWS Glue Pipeline

Opening visual editor and creating Glue Data Flow Source

1. In AWS Glue select the Visual ETL button on the Visual ETL page tab to enter the editor where the Glue Data Flow will be defined.

2. Select Amazon Redshift as a source. In the right side properties menu, leave direct data connection selected. Select the Glue Redshift Database Connection created from the previous section. Select the schema and table that was defined in Redshift that includes the stock ticker data.

Transforming data so that each stock ticker is separated out into separate data sets and outputting transformed data to s3 bucket. S3 data can then be imported into SageMaker in the next sections.

1. We will select to partition data in the last step of specifying the data destination. Partitioning the data will separate out the tickers into separate sets. This unfortunately, will remove the ticker column from the resulting output. To resolve this, adding a duplicate column and using it as the partition column, will result in the all the original columns showing up in the s3 data. To make this change add a new node called SQL Query and select Node Amazon Redshift as the node parent to link the two. Enter the following SQL query:

SELECT *, ticker AS ticker_copy FROM myDataSource

2. Create a new S3 Bucket that the newly formated data will be stored to. The resulting outputted streaming data will be ready to be imported SageMaker after the process in these steps is complete. The bucket will be selected as the destination for the data in the next step.

3. Add a new Amazon s3 Target node. Select SQL Query node as the node parent. Select output file format as CSV. No Compression. Set destination as the created destination s3 bucket in the previous step.

4. Select add partition key. Select ticker_copy as the column. Name and save the data flow. Select the run button to run the job.

5. Job will run and once completed, data will be outputted to the S3 Bucket. Ticker data is separated out into folders. When importing data to SageMaker, import the folder labelled with the stock ticker that will be be used in the machine learning model. All the data is stored within the files in that folder.

Automating Glue Data Flow

Opening visual editor and creating Glue Data Flow Source

1. Select the job in the list. Under actions menu on the right side, select schedule job.

2. Name the schedule. Select custom for the Frequency. Enter a Cron Expression for the desired frequency for the data to be pulled. Cron Expression is in the format cron(0/1 * * * ? *). Select create schedule.

Setting up SageMaker Studio

SageMaker Documentation still in progress

1. Create a new SageMaker Domain. A domains are a way for notebook files, users, and data to be organized into groups. Within domains, shared workspaces can be used. Multiple domains can be used when its needed to completely separate teams for security or separating different teams working in different sectors that shouldn't have access to the same resources.

2. Selecting set up for organizations will guide through extra choices that can be configured during the setup wizard phase. Walk through the domain setup wizard based on your preferences.

4. Users can be created to access SageMaker apps under the User Profiles tab within the domain. User access is defined in the IAM role defined in setup of the user. Access to those apps can also be accessed on the same page by selecting the launch button on the right side. Refer to AWS documents for more complex setup for users in IAM Identity Center or 3rd party documentation for other identity providers. Spaces for collaboration and workspace sharing between users can be setup in SageMaker Studio.