Salary Prediction : Part 2

Date: 11 Oct 2024

In the first part of this series, we introduced how Software Brio helped a USA based client predict salaries based on various factors, including industry, location, experience, and gender.

We broke down the solution into three major steps: data collection  and storage, model generation , and building a mobile tool .

In this blog, we'll explore the technical details of each step, offering an inside look at how we crafted, implemented, and refined every component.

HRs are using the AI salary prediction tool to help understand the salary to propose.

Step 1: Data Collection and Storage

One of the initial tasks was to ensure the client could easily upload salary data, whether in small batches or large volumes. This required a system that was user-friendly for manual uploads, yet robust enough to handle bulk data processing. Here’s how we accomplished this:

Frontend: User-Friendly Upload System

We built a ReactJS front-end  that served as the interface for the client’s HR team to upload data. The tool had two primary modes:

  • Manual Entry:  For smaller datasets, users could manually input data, such as industry, location, years of experience, and gender, into a form.
  • Bulk Uploads:  For larger datasets, we allowed for bulk uploading of CSV files. The bulk upload feature ensured that the client could easily handle hundreds of thousands of records at once.

In both cases, the data was sent through an API to a Flask backend , where it underwent preliminary validation.

Backend: Data Validation and Storage

Once the data was received on the backend, we performed data validation using Pandas , ensuring that it was properly formatted, cleaned, and checked for missing or inconsistent values.

  • Validation checks:  We ensured that each entry was complete and followed expected patterns (e.g., valid ranges for years of experience, matching gender categories).
  • Error Handling:  For bulk uploads, we provided detailed feedback, allowing users to correct errors and re-upload the file.

The validated data was then stored in Amazon S3 . This cloud-based storage system was ideal for managing large volumes of data with high availability and scalability. Each dataset was organized by month and indexed by industry, location, and other features for easy retrieval.

Technologies Used:

  • ReactJS : Frontend for the client to input or upload data.
  • Flask API : Backend to receive, validate, and process data.
  • Pandas : For validating and cleaning data.
  • Amazon S3 : Storage solution for datasets.
Software Brio API development using React for the frontend, Flask for the backend, pandas for data validation, and S3 for data storage.

Step 2: Data Processing and Model Generation

Once the data was securely stored, we moved on to generating the salary prediction model. This process required efficient data processing and model training to ensure timely and accurate predictions.

Spark Jobs for Large-Scale Data Processing

Given the size of the datasets, which included thousands of records across different industries and regions, we needed a system capable of handling large-scale data. For this, we used Apache Spark. Spark is designed for distributed computing, which means it can process huge amounts of data efficiently across multiple nodes.

Each month, Spark  retrieved the latest dataset from S3 and began the data processing pipeline:

  • Data Cleaning and Transformation: Spark handled any final transformations required to prepare the data for model training. For example, categorical variables like gender were converted into numerical values.
  • Feature Engineering:  Additional features were derived, such as interaction terms between industry and experience, which could help the model better understand relationships in the data.
  • Model Training:  Using Scikit-learn  with Spark, we trained a salary prediction model. For simplicity and accuracy, we used Multiple Linear Regression  initially, though the system was flexible enough to experiment with more complex models like Random Forest or XGBoost.

Hosting the Predictive Model via API

After training the model, the next task was to ensure that it could be accessed by various client tools and applications. We hosted the trained model on a Flask-based API that acted as the interface between the model and the client’s Android app.

  • API Security : The API was secured using OAuth 2.0  to ensure that only authorized users could access the model predictions.
  • Model Versioning : Each month, a new model was trained, and version control was maintained to track improvements and ensure transparency.

Technologies Used:

  • Apache Spark: For large-scale data processing and feature engineering.
  • Scikit-learn: For model training.
  • Flask API:  To serve the trained model as a RESTful service.
Software Brio’s machine learning pipeline, utilizing Spark and scikit-learn, with model hosting powered by Flask.

Step 3: Building the Mobile Application

Finally, the client needed a user-friendly mobile app to allow HR teams to input features like years of experience, industry, and gender, and receive salary predictions in real-time.

Android Application

We developed an Android app using Java that features a simple, intuitive interface. Users can input the required information and send the data to the API. The app is designed to be clean and efficient, ensuring easy navigation through the input fields.

  • Input Form:  Users filled out a form with inputs such as location, industry, years of experience, and gender.
  • Real-Time Salary Predictions:  Once the form was submitted, the app made an API call to the Flask-based model API. Within seconds, the user received a salary prediction based on the latest model.

Performance Optimization and Data Security

Given the importance of speed and security in a mobile app:

  • Async Calls : We used asynchronous API calls to ensure that the app remained responsive even when dealing with large datasets.
  • Data Encryption:  All data sent between the app and the API was encrypted to ensure privacy and confidentiality.

Technologies Used:

  • Java: For building the Android mobile app.
  • Retrofit:  For API communication between the app and backend.
  • Flask API:  For processing prediction requests.
Android development process and our optimization techniques for networking and security.

Conclusion

In Part 2, we delved into the technical side of how SoftwareBrio helped the client build a salary prediction tool. By leveraging a combination of cloud storage (S3), large-scale processing (Spark), and mobile development (Android), we were able to create a seamless, scalable solution.

In Part 3, we’ll discuss the challenges we encountered during development and how we addressed them, along with feedback from the client’s end-users on how the solution transformed their salary prediction process. Stay tuned!

Software Brio is a software consultancy company in India that develops custom AI solutions for clients. Follow us for more case studies like this. Explore our portfolio here: Portfolio Link

Bootstrap