Emakia System

A machine learning project designed to combat online harassment by filtering toxic content on social media platforms. Emakia addresses the urgent need to protect marginalized groups, including women, people of color, and LGBTQI+ individuals, from harmful online interactions.

Overview

Emakia is building an application to filter toxic content on social media. The project focuses on developing an efficient text classifier model and an AI system to validate labels and outputs. The system includes:

Text Classifier Training: Machine learning models for toxicity detection
Label Validation: LLM-based validation system using LangChain and OpenAI
Apple Development: iOS application with CoreML integration
Google Cloud Integration: Server-side processing and model deployment
Data Processing: Tools for data scraping, labeling, and preparation

Project Structure

emakia-system/
├── apple_development/          # iOS application development
│   ├── Enaelle/               # Main iOS app
│   └── Models/                # CoreML models and playground files
│
├── backend-Google-iPhone/     # Backend services for iOS integration
│   └── app.py                 # Main Flask/FastAPI application
│
├── google_cloud_server_development/
│   ├── text_classifier_training/  # Model training pipelines
│   │   ├── notebook/              # Jupyter notebooks for EDA and POC
│   │   ├── emakia-node/           # Node.js training scripts
│   │   └── validation/            # Model evaluation scripts
│   └── gcloud-toolkit-recent-search/  # Twitter API integration
│
├── LLM-RAG-Toxicity-Evaluator/  # Label validation using LLMs
│   ├── lang-chain-openai/       # OpenAI-based validation
│   ├── Gemini/                  # Google Gemini integration
│   ├── Grok/                    # Grok API integration
│   └── Fireworks/               # Fireworks AI integration
│
├── LLM_FN_Analysis/            # False Negative analysis
├── LLM_FP-Analysis/            # False Positive analysis
├── load-Neo4j/                 # Graph database integration
├── createnewlabels/            # Label generation and processing
├── Scrapping/                  # Data scraping tools
└── data/                       # Data storage (not in git)
    ├── raw/                    # Original, immutable data
    ├── interim/                # Intermediate transformed data
    ├── processed/              # Final datasets for modeling
    └── external/               # Third-party data sources

Key Components

Text Classifier Training

Training pipelines for toxicity detection models:

Jupyter notebooks for exploratory data analysis (EDA) and proof of concept
Node.js scripts for data processing
Validation scripts for model evaluation (evaluate_model_01.py, evaluate_model_02.py)

LLM-Based Label Validation

The system uses multiple LLM providers to validate training labels:

LangChain-OpenAI: Primary validation system
Gemini: Google's model integration
Grok: X.AI integration
Fireworks: Additional validation provider

Label Validation Results

Evaluation findings from LLM-based validation:

True Positive (TP): 31.50% - Harassment correctly identified
True Negative (TN): 41.68% - Neutral content correctly identified
False Negative (FN): 24.68% - Harassment missed (mislabeled as neutral)
False Positive (FP): 2.12% - Neutral content flagged as harassment

Overall Accuracy: 73.18% (TP + TN) Error Rate: 26.82% (FN + FP)

Note: 2% of training labels contained only return characters and need to be cleaned.

Apple Development

iOS application components:

SwiftUI/Cocoa application
CoreML model integration (emakiaTweetsSentimentClassifier.mlmodel)
Data persistence layer

Google Cloud Integration

Vertex AI model training and deployment
BigQuery data processing
Twitter API toolkit integration for data collection

Setup Instructions

Prerequisites

Python 3.x
Node.js (for backend services)
Xcode (for iOS development)
Google Cloud SDK (for cloud services)
OpenAI API key (for LLM validation)

Environment Setup

Clone the repository:

git clone https://github.com/Emakia-Project/emakia.git
cd emakia-system

Python Virtual Environment:

python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate
pip install -r requirements.txt

LangChain-OpenAI Setup:

pip install langchain_openai
export LANGCHAIN_PROJECT=default

Environment Variables: Create a .env file in the LLM-RAG-Toxicity-Evaluator/lang-chain-openai/ directory:
```
OPENAI_API_KEY=your_openai_api_key
```

Node.js Backend (if using):

cd google_cloud_server_development/text_classifier_training/emakia-node
npm install

Data Management

Data Directory Structure

The data/ directory follows a standard ML project structure:

raw/: Original, immutable data dumps
interim/: Intermediate data that has been transformed
processed/: Final, canonical datasets for modeling
external/: Data from third-party sources

Important: The data/ folder and .env files are excluded from git via .gitignore. These must be set up locally. To include them in version control, modify the .gitignore file accordingly.

Model Validation Approaches

The system uses two main approaches for label validation:

Direct Classification:
- Prompt: "You are a sentiment analyst. Analyze the following statement and respond with either 'positive' or 'negative'."
Explanation-Based Analysis:
- Prompt: "Analyze the following statement and explain your choice of overall sentiment."
- Focuses on analyzing false negatives and false positives for model improvement

Key Scripts

evaluateLabelswithLangchain.py: Main label validation script
evaluate_model_01.py, evaluate_model_02.py: Model evaluation with various metrics
evaluate-with-lexicon.py: Lexicon-based evaluation
generateCOREMLtraining_file.py: CoreML training data preparation

References

Twitter API Toolkit for Google Cloud

License

See LICENSE file for details.

Contributing

This is a private project. Please contact the maintainers for contribution guidelines.

Note: This project is under active development. Documentation and code are subject to change.

Name		Name	Last commit message	Last commit date
Latest commit History 496 Commits
.devcontainer		.devcontainer
.github		.github
Backend-google-heroku-iPhone		Backend-google-heroku-iPhone
LLM-RAG-Toxicity-Evaluator		LLM-RAG-Toxicity-Evaluator
LLM_FN_Analysis		LLM_FN_Analysis
apple_development		apple_development
biasmesh		biasmesh
data		data
google_cloud_server_development		google_cloud_server_development
.gitignore		.gitignore
LICENSE		LICENSE
QueriesEmakia.docx		QueriesEmakia.docx
README.md		README.md
sentiment_analysis.js		sentiment_analysis.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Emakia System

Overview

Project Structure

Key Components

Text Classifier Training

LLM-Based Label Validation

Label Validation Results

Apple Development

Google Cloud Integration

Setup Instructions

Prerequisites

Environment Setup

Data Management

Data Directory Structure

Model Validation Approaches

Key Scripts

References

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

Emakia-Project/emakia

Folders and files

Latest commit

History

Repository files navigation

Emakia System

Overview

Project Structure

Key Components

Text Classifier Training

LLM-Based Label Validation

Label Validation Results

Apple Development

Google Cloud Integration

Setup Instructions

Prerequisites

Environment Setup

Data Management

Data Directory Structure

Model Validation Approaches

Key Scripts

References

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages