Dynamic Data Generation: Enhancing a Big Data Analytics Module with Junmou's AI Capabilities
Abstract 


A pedagogical case study is presented from the School of Internet of Things at Xi'an Jiaotong-Liverpool University (XJTLU). This study details the application of XIPU AI (Junmou) in the IOT307TC Big Data Analytics module. The common challenge in big data education, specifically the difficulty of providing students with diverse, realistic, and hands-on datasets that contain specific, complex features, is addressed. Junmou was deployed to generate highly customizable, synthetic datasets for final-year students, thereby moving beyond traditional, static datasets. Based on student feedback and project outcomes, the results indicate that this AI-driven approach significantly enhanced student engagement and their ability to apply theoretical knowledge to practical, messy data. This case study provides a replicable framework for educators seeking to integrate AI tools for practical skill development in technology-intensive fields.


Keywords: AI-Driven Pedagogy, Synthetic Data, Big Data Analytics Education, Prompt Engineering, Problem-Based Learning.
 

1. Introduction
 
1.1 Course Background

In big data analytics education, a significant challenge is the lack of diverse, realistic datasets for students. Traditional, static datasets often do not contain the complexities encountered in professional settings, and real-world data from IoT devices are frequently restricted by proprietary or privacy regulations. This dilemma necessitates pedagogical innovation.

At XJTLU, the XIPU AI (Junmou) platform was utilized to address this issue. A study was conducted to document the integration of Junmou into the IOT307TC Big Data Analytics module, a compulsory course for 4th-year students. The central hypothesis was that a more dynamic and effective learning environment could be provided by creating bespoke, synthetic datasets with generative AI.

This approach is particularly relevant given the persistent theory-practice gap in data science, a field experiencing rapid expansion. Academic institutions face challenges such as the multidisciplinary nature of the field and the high cost of modern hardware [1]. Generative AI offers a solution by producing novel, synthetic data that mimics real-world properties, thereby overcoming barriers like data scarcity and privacy concerns [2]. This technology provides an inexhaustible supply of customizable information, enabling dynamic curricula and supporting the development of practical competencies [3].
 
1.2 Challenges in Bridging the Theory-Practice Gap in Data Science Education

Academic institutions face numerous challenges in bridging this gap. A primary difficulty is the inherent multidisciplinary nature of data science, which necessitates a curriculum that integrates knowledge from computer science, statistics, and mathematics, among other domains [4]. Compounding this issue is the high cost of providing modern hardware and software environments required for data science courses, as well as the significant challenge of retaining qualified educators who are often drawn to more lucrative opportunities in the private sector [5]. The dynamic and continuously evolving nature of data also requires ongoing upskilling, a burden academic programs struggle to meet. The core imperative for modern data science pedagogy is therefore to move beyond a focus on theoretical knowledge and to cultivate the practical, applied competencies that directly address these real-world challenges.

1.3 The Transformative Potential of Generative AI in Curriculum Design

A key innovation driving this framework is the application of generative artificial intelligence (AI) to create educational content. Generative AI, using models such as Generative Adversarial Networks (GANs) and large language models (LLMs), is revolutionizing the creation of training data by producing novel, synthetic datasets that mimic the statistical properties of real-world information [6].

This technology offers a powerful solution to several traditional barriers in data-driven education, including data scarcity, privacy concerns related to sensitive information, and the substantial cost and effort associated with collecting, preparing, and labeling large-scale, authentic datasets [7]. Synthetic data provides an inexhaustible, on-demand supply of customized information, accelerating innovation and enabling the rapid prototyping of new educational tools and methods without compromising the privacy of real individuals [8]. By leveraging generative AI, educators can move beyond the limitations of static, pre-packaged datasets and design curricula that are dynamic, authentic, and scalable.
 

2. Methodology: Enhancing the Classroom with an AI-Driven Framework

This case study is grounded in the principles of problem-based learning (PBL), a pedagogical approach where students learn by actively engaging with complex, real-world problems [9]. The rapid evolution of data analytics necessitates that students not only understand theoretical concepts but also possess the practical skills to handle real-world data challenges. However, conventional teaching methods often rely on a limited number of publicly available, pre-cleaned datasets that lack the specific complexities, such as missing values, biased samples, or diverse data types, that students will encounter in their professional lives. This artificial tidiness creates a false sense of security for students, leaving them unprepared for the unpredictable challenges of real-world data. 

By leveraging XIPU AI (Junmou) to generate bespoke, synthetic datasets, we created a dynamic and effective learning environment that aligns directly with the tenets of PBL. The AI-driven framework transformed the learning process from a passive exercise to an active, hands-on, and iterative process. As illustrated in the AI-Driven Learning Loop (Figure 1), the educator designs a problem, which Junmou translates into a "messy" dataset. Students then apply their data wrangling or cleaning skills and engage in troubleshooting to clean the data, leading to a final solution. The student's work and refined solutions provide valuable feedback that informs the educator's next problem design, completing the loop and fostering a continuous cycle of learning and improvement. 

The critical skill gap between theoretical data knowledge and the complex, messy reality of industrial data was directly addressed by this methodology. The learning process was made more engaging and personal by the problem-based nature of the AI-generated data, as students felt they were working on a real data science challenge rather than a classroom exercise. The datasets served as a crucial bridge between theoretical concepts and their practical implementation, solidifying students' understanding in a way that static, clean data could not. Ultimately, the complexities of industrial data projects were successfully simulated within a safe, guided educational environment, and students were empowered to critically engage with data's inherent complexities.

2.1 The Traditional Approach: Recognizing Pedagogical Limitations

Before the integration of Junmou, the module's practical sessions relied on traditional teaching datasets. These included widely-used, publicly available datasets like the Titanic survivor data or standardized sensor logs. While excellent for introducing foundational concepts like data loading, basic visualization, and simple statistical analysis, these datasets presented significant pedagogical limitations. They were typically pre-cleaned, well structured, and designed for straightforward analysis. This artificial tidiness created a false sense of security for students, who were not adequately prepared to handle the unpredictable challenges of real-world data, such as:

• Missing Data: Real-world datasets rarely have complete entries. Students needed to practice sophisticated imputation and handling techniques.

• Outliers and Anomalies: Data from IoT devices can be full of sensor errors, network glitches, or unusual events that manifest as outliers. Identifying and addressing these is a critical skill.

• Heterogeneous Data Types: A single IoT analytics project might involve structured numerical data, unstructured text logs, and time-series data. Students needed practice in integrating and harmonizing these diverse formats.

• Data Scarcity and Sensitivity: Many relevant real-world datasets are proprietary or contain sensitive information, making them inaccessible for classroom use.
 
These limitations highlighted a need for a more dynamic and customizable approach to data provision, one that could evolve with the specific learning objectives of each class session.

2.2 Integrating Junmou: A Step-by-Step Implementation

The implementation of Junmou followed a two-phased approach: a preparatory phase and an in-class execution phase. This methodology is summarized in Figure 1, which illustrates a continuous cycle of problem design, AI generation, and student-driven analysis that informs future problem creation.
 

Figure 1: The AI-Driven Learning Loop

This diagram illustrates a pedagogical framework where an AI, Junmou, is a central tool for creating a dynamic and iterative learning environment. The loop begins with an Educator designing a problem prompt. Junmou then Generates a "messy" dataset that presents a unique challenge for each student. Students then engage in Troubleshooting and Skill Application to clean and wrangle the data. Finally, the students' work provides valuable Feedback and New Insights that inform the educator's next problem design, completing the loop and fostering a continuous cycle of learning and improvement.
 

Phase 1: Preparatory Prompt Engineering.

A series of carefully constructed prompts for Junmou were drafted before each lab session. The prompts were designed to generate data with a specific, embedded problem and a clear pedagogical goal, rather than simply generating data. For instance, for a lesson on time-series analysis, a prompt would specify the need for a dataset with temporal features and specific noise patterns. Figure 2 provides a visual example of this process. This approach transformed the educator's role from simply selecting a dataset to "prompt engineering" a dynamic learning scenario that forced students to apply a specific analytical technique and troubleshoot a realistic problem.

Example Prompt for Time-Series Analysis: "As a data scientist, you are analyzing a large dataset of temperature and energy consumption from a smart building. Generate a realistic-looking time-series dataset for a single week, with data points every 15 minutes. The data should show a clear cyclical pattern (daily and weekly) but also include random noise and a few sudden, sharp spikes in energy consumption to simulate an anomaly event like an equipment malfunction. Present the data as a JSON object."
 

Figure 2: Example of Prompt Engineering for Time-Series Data Generation
 

Phase 2: In-Class Execution and Guided Learning.

In the classroom, the generated dataset was presented as a case study. The Junmou prompt was first reviewed to understand the embedded challenges. This collaborative process allowed a direct link between data generation parameters and practical coding challenges to be seen by students. Students were also encouraged to use Junmou themselves to generate small-scale test datasets for rapid prototyping and iterative problem-solving. This full data flow is represented in Figure 3.
 

Figure 3: Data Flow Diagram for a Single Problem

The use of Junmou to generate data with accompanying metadata was a key element of this methodology. For example, the AI would be asked to "generate a data dictionary explaining each column and its intended purpose." This forced students to practice data exploration and documentation, another crucial skill that is often overlooked. The AI's ability to interpret and explain data patterns was also leveraged. After their analysis was completed, students would use Junmou to ask questions such as, "What kind of analytical model is best suited for this data?" The AI's response would serve as a basis for a class discussion on model selection and validation.

As shown in Figure 4, Junmou was also used to generate data documentation and assist with model selection, fostering critical skills often overlooked in traditional curricula. The diagram highlights how students would ask the AI to explain data patterns or suggest suitable analytical  models, turning the AI's response into a basis for class discussion.
 

Figure 4: Junmou as a Tool for Data Documentation and Model Selection

This methodology created a controlled yet realistic sandbox for big data analytics. It moved beyond passive learning and enabled a truly hands-on, problem-based approach where students were actively engaged in shaping their own data and, by extension, their own learning challenges. The success of this approach lies in its ability to simulate the complexities of industrial data projects within a safe, guided educational environment, as seen in Figure 5, which visually demonstrates the "before and after" impact of the student's work.
 

Figure 5: Data Cleaning Process for a Messy Dataset

This diagram visually demonstrates the impact of data wrangling, showcasing a "Before" and "After" view of a dataset. The "Before" table highlights common real-world data issues such as mixed data types (e.g., "45" as a string in the Price column), missing values (NaN), and inconsistent date formats. The "After" table shows the corrected data, illustrating how a clean, consistent dataset can be achieved through a systematic cleaning process. This transformation is crucial for accurate analysis and modeling.
 
 
3. Results and Discussion

The integration of XIPU AI (Junmou) into the IOT307TC Big Data Analytics module yielded significant and multifaceted improvements in student learning, engagement, and practical skill development. These results were analyzed through a combination of qualitative feedback from surveys and open-ended comments, and a quantitative assessment of final project outcomes compared to previous cohorts.

3.1 Quantitative Impact on Student Learning Outcomes

The most compelling outcome was the demonstrable improvement in students’ ability to handle the "messy" data typical of real-world projects. By forcing them to confront specific, AI-generated challenges, we observed a profound shift in their approach to data analytics.

• Advanced Data Wrangling Proficiency: As shown in Figure 6, students in the Junmou-integrated class demonstrated a more sophisticated understanding of data cleaning and preprocessing. For instance, over 85% of student groups successfully implemented an advanced imputation strategy for handling missing values, a significant increase from the previous average of under 30%. This indicates a deeper conceptual grasp of when and why to apply more complex techniques.
 

Figure 6: Comparative Proficiency in Advanced Data Wrangling Techniques

• Robust Anomaly Detection: The inclusion of intentional outliers in the AI-generated datasets directly prepared students for a critical aspect of IoT data analysis. The projects this year showcased a significantly higher rate of using dedicated anomaly detection algorithms, such as the Isolation Forest or Local Outlier Factor (LOF), rather than just relying on simple statistical thresholds. This hands-on experience with embedded problems fostered a practical intuition for data validation.

• Improved Project Complexity and Innovation: The freedom to generate their own datasets allowed students to explore more ambitious and nuanced project ideas. Projects moved beyond basic exploratory data analysis to include complex machine learning models. For example, one group used a Junmou-generated time-series dataset to build a predictive model for building energy consumption, showcasing a level of application-specific innovation that was less common in previous years.

3.2 Qualitative Insights and Discussion

Beyond the metrics, the qualitative feedback from students highlighted a transformative shift in their perception of the module and the subject matter itself.

• Increased Engagement and Ownership: The "problem-based" nature of the AI-generated data made the learning process more engaging and personal. As one student commented, "It felt less like a classroom exercise and more like a real data science challenge. The data wasn't perfect, and that made it more interesting to work on." Figure 7 illustrates how the "problem-based" nature of the AI-generated data makes the learning process more engaging, showcasing students in a computer lab collaborating on a data science project.
 

Figure 7: Students collaborating on a data science project in a computer lab D-1002-TC using AI-Driven Framework.

• Bridging Theory and Practice: The AI-generated datasets served as a crucial bridge between theoretical concepts and their practical implementation. Students reported that seeing a textbook concept, such as "multivariate time-series data," come to life as a messy, Junmou-generated JSON object solidified their understanding in a way that reading a clean dataset never could.

• Fostering a Culture of Collaboration: In the presential classroom environment, working on these complex, simulated datasets sparked a more collaborative atmosphere. Students were often seen discussing strategies for handling specific data quirks or debugging code together, with the dataset itself becoming the central point of discussion and joint problem-solving. This peer-to-peer learning was a direct and positive consequence of the AI-driven methodology.

3.3 Challenges and Limitations

A significant challenge was ensuring the AI-generated datasets were both messy and logically consistent. Initial prompts sometimes produced illogical data, which required an iterative refinement process. For instance, a simple sales dataset might have mixed text and numerical values, while a time-series dataset might have nonsensical energy consumption values or inconsistent timestamps. This iterative prompt refinement became a valuable lesson in data validation.

Another limitation was the time required for prompt engineering. Crafting a complex prompt to embed a specific problem demanded significant pedagogical foresight and iterative refinement.

Finally, while synthetic data allowed for the exploration of diverse problems, it does not fully replicate the real-world context of proprietary datasets. Ethical and privacy considerations were discussed conceptually but not practically addressed, a critical area for more advanced modules.
 

4. Conclusion

The transformative potential of integrating AI, specifically Junmou, into a pedagogical framework is unequivocally demonstrated by our case study within the IOT307TC Big Data Analytics module. By shifting away from static, pre-cleaned datasets, students were provided with a more authentic, challenging, and engaging learning experience. This methodology, which is rooted in prompt engineering, allowed the critical skill gap that exists between theoretical data knowledge and the complex, messy reality of industrial data to be addressed directly.

The quantitative and qualitative results of this study are compelling. A significant increase was shown in the proficiency of students in the Junmou-integrated cohort with advanced data wrangling techniques, anomaly detection, and the ability to design more innovative and complex projects. The AI-generated, problem-based datasets fostered a sense of ownership and relevance, transforming passive learning into an active, hands-on, and iterative process. Furthermore, a new level of classroom discussion and collaboration was sparked by the use of the AI as a tool for data documentation and model selection.

The value of pedagogical innovation in higher education is underscored by this experience, and a powerful example of how AI can be a partner in the classroom is provided. The gap between theory and real-world application has been successfully bridged, preparing students not just to understand data, but to critically engage with its inherent complexities. Colleagues across XJTLU and other higher educational institutions are encouraged to explore similar applications of AI, pushing the boundaries of what is possible in educational delivery and empowering the next generation of data professionals.
 

References

[1] "Investigating the data science talent gap: Data practitioners' perspectives," ResearchGate. Accessed: Sep. 22, 2025. [Online]. Available: 
https://www.researchgate.net/publication/391276552_Investigating_the_data_science_talent_gap_Data_practitioners'_perspectives

[2] "Data Science Education – A Scoping Review," ResearchGate. Accessed: Sep. 22, 2025. [Online]. Available: https://www.researchgate.net/publication/372531547_Data_Science_Education_-_A_Scoping_Review

[3] "Big-Data Skills: Bridging the Data Science Theory-Practice Gap in Higher Education and Industry," PMC, Accessed: Sep. 22, 2025. [Online]. Available: 
https://pmc.ncbi.nlm.nih.gov/articles/PMC7883353/

[4] "Curriculum, Pedagogy, and Teaching/Learning Strategies in Data Science Education," MDPI, Accessed: Sep. 22, 2025. [Online]. Available: https://www.mdpi.com/2227-7102/15/2/186

[5] "Challenges and Issues in Data Science Education," IEEE Computer Journal, Accessed: Sep. 22, 2025. [Online]. Available: https://idsc.miami.edu/challenges-and-issues-in-datascience-education/

[6] "How Generative AI Is Revolutionizing Training Data with Synthetic Datasets," Dataversity, Accessed: Sep. 22, 2025. [Online]. Available: https://www.dataversity.net/howgenerative-ai-is-revolutionizing-training-data-with-synthetic-datasets/

[7] "A study of the impact of project-based learning on student learning effects: a metaanalysis study," PMC, Accessed: Sep. 22, 2025. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10411581/

[8] K. Misiejuk, S. López-Pernas, R. Kaliisa, and M. Saqr, "Mapping the Landscape of Generative Artificial Intelligence in Learning Analytics: A Systematic Literature Review," J. Learn. Anal., vol. 12, no. 1, 2025.

[9] B. Santana-Perera, C. García-Barceló, M. González Arcas, and D. Gil, "Exploring Predictive Insights on Student Success Using Explainable Machine Learning: A Synthetic Data Study," Future Internet, vol. 16, no. 9, p. 763, 2025.

 


AUTHOR
Dr. Izhar Oswaldo Escudero Ornelas,
Assistant Professor, BEng, MSc, Ph.D, AFHEA,
School of Internet of Things, Xi'an Jiaotong-Liverpool University (XJTLU)

DATE
17 October 2025

Related Articles