- Ellie Kloberdanz

Performing data analysis and modeling on medical data can provide extremely useful insights into both public and individual health. However, there are two primary challenges when it comes to running statistical analyses or developing predictive models with medical data. The first challenge is the size of medical data sets. Medical trials often include a number of participants that may be too small for creating complex machine learning models. The second challenge is the fact that medical data are governed by various privacy rules and laws such as HIPAA. One approach to solving this issue is to use differential privacy techniques that obscure data points related to specific individuals to preserve privacy of their information. However, the downside is that this allows for studying only aggregated data at a high level. Moreover, the noise added to individual data points to ensure privacy may result in data that is manipulated too far from the original. Therefore, there is a trade-off between the noise added (also called the privacy budget ϵ) that provides stronger privacy protection and the utility of the data. The figure below demonstrates this trade-off between data privacy and utility with an example of an employee database query that returns the total number of employees in two different months. We can see that as the privacy budget increases, the total count of employees becomes inaccurate, which may help to hide some private information such as a termination of a specific individual at the expense of accurately reporting the total headcount.

**Figure 1: Differential privacy techniques have a trade-off between privacy protection and data utility (Chang et al., 2021)**

This is where Cape's confidential computing platform based on AWS Nitro enclaves comes in.

Cape's confidential computing platform allows its users to process data in a privacy preserving manner without needing to make a compromise between data privacy and utility. With Cape, you don't have to use differential privacy methods, instead you can process your original data as is, because your data will be encrypted and processed in a secure enclave in the cloud.
Cape provides a CLI that enables its users to encrypt their input data, and deploy and run serverless functions with easy commands: *cape encrypt*, *cape deploy*, and *cape run*. Additionally, Cape also provides two SDKs: pycape and cape-js, which allow for using cape within Python and JavaScript programs respectively.

In this blog we will use a publicly available breast cancer dataset, which contains tabular data describing several attributes that describe the breast tumor (e.g.: the size and shape of the tumor) along with a classification of the tumor as malignant or benign. For example, a tumor that is uniform and has a round shape typically indicates that it is noncancerous. While this dataset is publicly available, most medical data is not, and we will use it as an example to demonstrate how Cape can be leveraged for private medical data processing.

Since the model that we wish to develop is a binary classification model that identifies breast tumors as malignant or benign and the number of data points is not very large, a logistic regression model is suitable. Logistic regression is a classification model that uses input attributes to predict a categorical variable, eg. yes or no. In this demonstration we focus on a binary classification since there are only two possible outcomes.

Any function that is deployed with Cape needs to be named *app.py*, where *app.py* needs to contain a function called *cape_handler()* that takes the input that the function processes and returns the results. In this case the input is the breast cancer dataset that serves as training data and the output is the trained logistic regression model.
The code snippet below shows our *app.py*. First, we import some libraries and define a logistic regression class with methods that can perform training or compute model accuracy and loss.

*Import libraries*

```
import pandas as pd
import numpy as np import copy
# Define a logistic regression class
class LogisticRegression():
def init(self):
self.losses = []
self.train_accuracies = []
def accuracy_score(self, y_true, y_pred):
correct = np.sum(y_true == y_pred)
accuracy = correct/y_true.shape[0]
return accuracy`
def fit(self, x, y, epochs):
x = self._transform_x(x)
y = self._transform_y(y)
self.weights = np.zeros(x.shape[1])
self.bias = 0
for i in range(epochs):
x_dot_weights = np.matmul(self.weights, x.transpose()) + self.bias pred = self._sigmoid(x_dot_weights)
loss = self.compute_loss(y, pred) error_w, error_b = self.compute_gradients(x, y, pred)
self.update_model_parameters(error_w, error_b)
pred_to_class = [1 if p > 0.5 else 0 for p in pred]
self.train_accuracies.append(self.accuracy_score(y, pred_to_class))
self.losses.append(loss)
def compute_loss(self, y_true, y_pred):
# binary cross entropy
y_zero_loss = y_true * np.log(y_pred + 1e-9)
y_one_loss = (1-y_true) * np.log(1 - y_pred + 1e-9)
return -np.mean(y_zero_loss + y_one_loss)
def compute_gradients(self, x, y_true, y_pred):
# derivative of binary cross entropy
difference = y_pred - y_true
gradient_b = np.mean(difference)
gradients_w = np.matmul(x.transpose(), difference)
gradients_w = np.array([np.mean(grad) for grad in gradients_w])
return gradients_w, gradient_b
def update_model_parameters(self, error_w, error_b):
self.weights = self.weights - 0.1 * error_w
self.bias = self.bias - 0.1 * error_b
def predict(self, x):
x_dot_weights = np.matmul(x, self.weights.transpose()) + self.bias
probabilities = self._sigmoid(x_dot_weights)
return [1 if p > 0.5 else 0 for p in probabilities]
def _sigmoid(self, x):
return np.array([self._sigmoid_function(value) for value in x])
def _sigmoid_function(self, x):
if x >= 0: z = np.exp(-x) return 1 / (1 + z)
else: z = np.exp(x) return z / (1 + z)
def _transform_x(self, x):
x = copy.deepcopy(x)
return x.values
def _transform_y(self, y):
y = copy.deepcopy(y)
return y.values.reshape(y.shape[0], 1)
```

In addition to the logistic regression class, our *app.py* also contains the required *cape_handler* function, which takes the training data as input, splits it into a train and test set, instantiates the above defined logistic regression class, performs training, and outputs the trained model along with its accuracy.

```
# Cape handler
def cape_handler(input_data):
csv = input_data.decode("utf-8")
csv = csv.replace('\\t', ',').replace('\\n', '\n')
f = open('data.csv', 'w')
f.write(csv)
f.close()
data = pd.read_csv('data.csv')
data_size = data.shape[0]
test_split = 0.33
test_size = int(data_size * test_split)
choices = np.arange(0, data_size)
test = np.random.choice(choices, test_size, replace=False)
train = np.delete(choices, test)
test_set = data.iloc[test]
train_set = data.iloc[train]
column_names = list(data.columns.values)
features = column_names[1:len(column_names)-1]
y_train = train_set["target"]
y_test = test_set["target"]
X_train = train_set[features]
X_test = test_set[features]
lr = LogisticRegression()
lr.fit(X_train, y_train, epochs=150)
pred = lr.predict(X_test)
accuracy = lr.accuracy_score(y_test, pred)
# trained model
model = {"accuracy": accuracy, "weights": lr.weights.tolist(), "bias": lr.bias.tolist()}
return model
```

To deploy our function with Cape, we first need to create a folder that contains all needed dependencies. For this logistic regression training app, that deployment folder needs to contain the *app.py* above. Additionally, because the *app.py _program imports some external libraries (in this case: numpy and pandas), the deployment folder needs to have those as well. We can save a list of those dependencies into a _requirements.txt* file and run docker to install those dependencies into our deployment folder called *app* as follows:

`sudo docker run -v pwd:/build -w /build --rm -it python:3.9-slim-bullseye pip install -r requirements.txt --target ./app/`

Now that we have everything ready, we can log into Cape:

```
cape login
Your CLI confirmation code is: GZPN-KHMT Visit this URL to complete the login process: https://login.capeprivacy.com/activate?user_code=GZPN-KHMT Congratulations, you're all set!
```

And after that we can deploy the app:

```
cape deploy ./app
Deploying function to Cape ...
Success! Deployed function to Cape
Function Checksum ➜ 348ea2008f014b4d62562b4256bf2ddbbebcbd8b958981de5c2e01a973f690f8
Function Id ➜ 5wggR4ZaEBdfHQSbV2RcN5
```

Now that the app is deployed, we can pass it an input and invoke it with *cape run*:

```
cape run 5wggR4ZaEBdfHQSbV2RcN5 -f breast_cancer_data.csv \
{'accuracy': 0.9197860962566845, 'weights': [10256.691270418847, 19071.613672774896, 63157.95554188486, 97842.31573298419, 106.154850842932, 43.29810217015701, -44.1862890971466, -22.519840356544492, 198.12010662303672, 78.6238754895288, 48.39822623036688, 1508.6634081937177, 342.695612801048, -22814.6600120419, 8.905474463874354, 16.958969184554977, 18.625567417774857, 7.857666827748692, 25.00139435235602, 4.305377619109947, 9667.094831413606, 24077.953801047104, 59698.82218324606, -91019.69570680606, 137.85512994764406, 64.23315269371734, -35.801829085602265, 1.0606119748691598, 287.2889897905756, 89.52499975392664], 'bias': 3.247905759162303}
```

The output above lists the parameters of the trained model, i.e.: its weights and bias, which define the model and can be used to perform inference. Additionally, we can also see that the trained model accuracy on testing data is 92%.

In this blog we discussed the challenges of developing predictive models on medical data and how Cape's confidential computing platform can alleviate privacy issues associated with medical data processing. We defined a logistic regression model and trained to identify breast tumors as malignant or benign while keeping the medical data that was used for training confidential.