跳至正文

English

 Cloud for Data Analysts and Data Scientists

Home » Blog » 数据分析专题 » Cloud for Data Analysts and Data Scientists

 Cloud for Data Analysts and Data Scientists

Introduction

In the past decade, the field of data analytics has been transformed by the rapid adoption of cloud computing from traditional related database server (DB2, Oracle, SQL server, etc). Cloud platforms—such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), Alibaba cloud —provide scalable, flexible, and cost-effective solutions for storing, managing, processing, and analyzing data. For data analysts and data scientists, the cloud offers an unprecedented ability to access computing power, storage, and collaborative environments on demand. In this paper, we  introduces the applications of cloud technologies in data analytics, focusing on their impact on workflows, collaboration, and efficiency. It also demonstrates some examples of SQL, SAS  and python coding in a cloud context, and also provide free cloud sources.


What is the  Cloud Computing ?

For some data analyst or data scientist, what is cloud? Do we need to know architect of cloud?  Do we need to know how cloud works? The answer is no.  Briefly speaking, Cloud computing is a platform to the delivery of computing services—including servers, storage, databases, networking, software, and analytics—over the internet. Instead of managing physical infrastructure, organizations can rent resources on demand, paying only of Cloud for what they use. This business model reduces upfront costs, shortens deployment times, and provides the flexibility needed to scale quickly as data volumes grow.

In the opinion of  data professionals, cloud computing enables:

  1. Elastic scalability: Easily expand or contract resources depending on workload needs.
  2. Global accessibility: Access datasets and analysis tools from any location.
  3. Integration with advanced tools: Seamlessly use machine learning frameworks, big data platforms, and visualization tools.
  4. Collaboration: Multiple stakeholders can interact with the same data resources simultaneously.

Relevance to Data Analysts and Data Scientists

Data Analysts

Data analysts use various statistical methods to solve the business problems, these methods includes descriptive and diagnostic analytics. In the past three or four decades, the data for business analysis are stored in related database. As the data explodes, and types of data are increasing, traditional database is not enough. Cloud technologies provide analysts with:

  • Cloud-hosted databases (e.g., Amazon Redshift, Azure Synapse, Google BigQuery) to run SQL queries on massive datasets.
  • Visualization platforms (e.g., Tableau Cloud, Power BI Service, Looker) integrated directly with cloud storage.
  • Automated ETL tools that reduce the burden of data cleaning and transformation.

Data Scientists

Data scientists extend beyond analysis to predictive and prescriptive analytics, including machine learning, statistical modeling, and experimentation. The cloud provides:

  • High-performance compute clusters for training models on large structured or unstructured datasets .
  • Managed machine learning services like AWS SageMaker, Azure ML Studio, and Google Vertex AI.
  • Notebook-based collaboration using JupyterHub, Databricks, or SAS Viya in the cloud.
  • Integration with programming languages such as Python, R, and SAS.

Cloud Data Storage and Management

One of the most crucial applications of cloud technologies is storage and management of data. Services like Amazon S3, Azure Blob Storage, and Google Cloud Storage act as data lakes where raw, semi-structured, and structured data can be stored at scale. These storage solutions integrate seamlessly with query engines such as Athena (AWS) and BigQuery (GCP), allowing data analysts to perform SQL queries directly on data without moving it. Data analysts or data scientists should know where we can find our data, where we can write queries and how to create queries.

Example: SQL in Google BigQuery

— Example SQL query on a cloud-hosted dataset

SELECT

    customer_id,

    COUNT( distinct order_id) AS total_orders,

    SUM(order_amount) AS total_spent

FROM

    `ecommerce.orders`

WHERE

    order_date BETWEEN ‘2025-04-01’ AND ‘2025-06-30’

GROUP BY

    customer_id

ORDER BY

    total_spent DESC

LIMIT 10;

This query illustrates how a data analyst can generate insights about customer purchasing behavior across q2 of 2025  by using SQL in BigQuery of GCP. This SQL is similar to T-sql or PL/SQL in idata classes.


Cloud-Based Data Transformation and ETL

If you are data engineer,  Extract, Transform, Load (ETL) processes are essential for cleaning and preparing data for analysis. Cloud-native ETL tools, such as AWS Glue, Azure Data Factory, and Google Dataflow, automate these tasks while integrating with machine learning pipelines.

Example: SAS on the Cloud for Data Transformation

SAS, a traditional analytics platform, has expanded into the cloud through SAS Viya. Analysts and data scientists can use SAS to transform and prepare data stored in cloud environments:

/* Connecting to a cloud-hosted database and transforming data */

libname mydb odbc datasrc=”AzureSQLDatabase” user=”analyst” password=”mypassword”;

/* Data transformation */

data work.cleaned_orders;

    set mydb.orders;

    if not missing(order_id) and order_amount > 0;

    order_year = year(order_date);

run;

/* Summarize results */

proc sql;

    select order_year, count(*) as total_orders, avg(order_amount) as avg_spent

    from work.cleaned_orders

    group by order_year;

quit;

This SAS code demonstrates how cloud-hosted data can be transformed and analyzed seamlessly using familiar SAS syntax.


Cloud-Based Analytics and Machine Learning

Cloud platforms provide both infrastructure (compute clusters, GPUs, TPUs) and managed services to accelerate analytics and machine learning.

  • AWS SageMaker: End-to-end machine learning workflows from data preparation to deployment.
  • Azure Machine Learning: Collaborative model development with automated ML capabilities.
  • Google Vertex AI: Unified AI platform for model training, deployment, and monitoring.

These tools enable data scientists to move from exploration to production more quickly, often reducing the need for specialized infrastructure.

Example: Training a Predictive Model in SQL (BigQuery ML)

— Train a logistic regression model in BigQuery ML

CREATE OR REPLACE MODEL `retention_model`

OPTIONS(model_type=’logistic_reg’) AS

SELECT

    age,

    account_length,

    num_complaints,

    retention  AS label

FROM

    `banking.customer_data`;

In this example, a predictive model for customer retention is trained directly within the SQL interface of BigQuery ML—showcasing how cloud platforms blur the line between data analysis and machine learning.


Collaboration and Workflow Optimization

Cloud technologies enhance collaboration among analysts, scientists, and business stakeholders. Notable benefits include:

  • Version control: Integration with GitHub or GitLab in cloud environments ensures reproducibility.
  • Shared workspaces: Tools like Databricks allow simultaneous collaboration on notebooks.
  • Scalable dashboards: Cloud-hosted dashboards can be updated in real time and accessed globally.

Example: Collaborative SAS Viya Environment

SAS Viya provides browser-based access to data and models hosted in the cloud. Analysts and data scientists can co-develop pipelines without worrying about local computing constraints. For example, one team member might run an exploratory SAS procedure while another builds predictive models, both referencing the same underlying cloud-hosted data.

Python Programming for Risk Data Analysis in the Cloud

Python is one of the most widely used languages for data science and data analyst. We also can conduct data analysis and AI model in cloud by employing python.  Cloud platforms provide scalable environments (Databricks, JupyterHub on GCP/AWS/Azure, or SAS Viya with Python integration) where Python code can be executed against large datasets. Risk analysis applications typically include credit scoring, fraud detection, and portfolio risk measurement.

Example 1: Credit Risk Scoring in the Cloud

# Example using scikit-learn for credit risk analysis

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

from google.cloud import bigquery

# Load data from BigQuery (Google Cloud)

client = bigquery.Client()

query = “””

SELECT age, income, credit_score, default_flag

FROM `finance_dataset.customer_risk`

“””

df = client.query(query).to_dataframe()

# Train-test split

X = df[[‘age’, ‘income’, ‘credit_score’]]

y = df[‘default_flag’]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Logistic regression model

model = LogisticRegression()

model.fit(X_train, y_train)

# Predictions and evaluation

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

This workflow connects directly to Google BigQuery, retrieves customer credit data, trains a logistic regression model for predicting defaults, and evaluates results.


Example 2: Fraud Detection with Cloud Data

# Example: Anomaly detection for fraud risk using Isolation Forest

import pandas as pd

from sklearn.ensemble import IsolationForest

from azure.identity import DefaultAzureCredential

from azure.storage.blob import BlobServiceClient

# Connect to Azure Blob Storage

account_url = “https://<your_account_name>.blob.core.windows.net”

container_name = “transactions”

credential = DefaultAzureCredential()

blob_service_client = BlobServiceClient(account_url, credential=credential)

# Load transaction data

blob_client = blob_service_client.get_blob_client(container=container_name, blob=”transactions.csv”)

data = pd.read_csv(blob_client.download_blob().readall())

# Train isolation forest for fraud detection

clf = IsolationForest(random_state=42)

data[‘fraud_score’] = clf.fit_predict(data[[‘amount’, ‘transaction_time’, ‘merchant_id’]])

# Flag potential fraud cases

suspicious = data[data[‘fraud_score’] == -1]

print(“Potential fraud cases detected:”, suspicious.shape[0])

This example demonstrates how risk analysts can use cloud-hosted transaction data (in Azure Blob Storage) with Python machine learning algorithms to detect suspicious behavior.

Free Cloud Resources for Practicing SQL and Python

Here are some free platforms where you can practice SQL and Python in a cloud-like environment:

  1. Google BigQuery (Sandbox)
    • Free tier: 10 GB storage & 1 TB queries per month.
    • Practice SQL directly in the browser.
    • BigQuery Sandbox (https://cloud.google.com/bigquery/docs/sandbox)
  2. Snowflake Free Trial
    • $400 free credits and free database access for learning SQL.
    • Snowflake Free Trial (https://signup.snowflake.com/ )
  3. Microsoft Azure for Students
    • Free $100 credits (no credit card required) for cloud SQL and Python projects.
    • Azure for Students( https://azure.microsoft.com/en-us/free/students/)
  4. Google Colab (Python Notebooks)
    • 100% free Python execution environment with GPU/TPU support.
    • Excellent for practicing machine learning and risk analytics.
    • Google Colab (  https://colab.research.google.com/)
  5. Kaggle Notebooks
    • Free Python environment with preloaded datasets.
    • Useful for practicing risk modeling without setup.
    • Kaggle Notebooks(  https://www.kaggle.com/code )
  6. Mode Analytics SQL Tutorial
    • Interactive browser-based SQL editor with free sample datasets.
    • Mode SQL Tutorial  (https://mode.com/sql-tutorial/)

Future Directions

As organizations increasingly adopt multi-cloud and hybrid-cloud strategies, the role of the cloud in analytics will only deepen. Emerging trends include:

  • Serverless analytics: Running queries or models without provisioning infrastructure.
  • Automated machine learning (AutoML): Reducing the barrier to entry for predictive modeling.
  • Integration with generative AI: Enhancing analytics workflows with advanced natural language interfaces.
  • Edge-cloud synergy: Bringing analytics closer to data sources in IoT environments.

These trends suggest a future where analysts and scientists focus less on infrastructure and more on extracting actionable insights.


Conclusion

Cloud computing has revolutionized the practice of data analysis and data science. For data analysts, cloud-hosted SQL engines, ETL services, and visualization tools enable efficient analysis of large datasets. For data scientists, managed machine learning environments and scalable compute resources accelerate experimentation and model deployment. Through examples in SQL. Python and SAS, this paper illustrates the practical applications of cloud technologies in modern analytics workflows.

As data volumes continue to grow, cloud platforms will remain indispensable for organizations seeking to unlock the full potential of their data. The combination of scalability, flexibility, and advanced tools makes the cloud not just an option but a necessity for the future of analytics.