News

Research Projects

My research interests include text mining, scholarly big data, applied machine learning and deep learning, natural language processing, and information retrieval

Applied Machine Learning

Apply machine learning and deep learning in real-world problems.

Natural Language Processing

Extract domain knowledge from textual content in scholarly papers.

Information Retrieval

Build focused search engine systems to index and discover semantic level information.

Scholarly Big Data

Develop tools to process and present multi-modality scholarly big data at large scale.

CiteSeerX

Utilizing the established infrastructure, this project aims to create a sustainable CiteSeerX system with new data resources and a much larger data collection. We will develop a new system that runs with low operation overhead, and that provides quality and enriched data and metadata in portable formats that will be available through accessible user interfaces. We will ingest all freely accessible scientific documents on the Web, currently estimated to be 30 million. CiteSeerX will make available high-quality metadata through an accessible Web User Interface, Application Programming Interface, and data dumps. This project is supported by the National Science Foundation. Key people include PI Dr. C. Lee Giles (PSU) and Co-PI Dr. Jian Wu.

The ODU's role co-direct graduate students and postdoctoral scholars on designing essential components of infrastructure, architecture, data acquisition, extraction, ingestion, cleansing, and indexing.

Mining Electronic Theses and Dissertations

We will bring computational access to book-length documents, through a research and piloting effort employing Electronic Theses and Dissertations (ETDs). The library and archives fields lack research on extracting and analyzing segments of long documents (chapters, reference lists, tables, figures), as well as methods for summarizing individual chapters of longer texts to enable findability. The project brings cutting-edge CS and machine learning technologies to advance discovery, use, and potential for reuse of the knowledge hidden in the text of books and book-length documents. By focusing on libraries' ETD collections, the research will enhance ETD programs, devising effective and efficient methods for opening the knowledge currently hidden in the rich body of graduate research and scholarship. This project is supported by the Institute of Museum and Library Services (IMLS). This project is a joint effort between Virginia Tech and ODU, directed by PI Bill Ingram, Co-PI Dr. Edward A. Fox, and Co-PI Dr. Jian Wu.

Currently, the ODU team will be responsible for extracting metadata and full text out of scanned ETDs using OCR techniques and then segmenting full text into chapters and sections. The ODU team will extract semantic information such as concepts and their definition.

Synthetic Prediction markets with Algorithm Traders for Determining Experimental Reproducibility

This project studies R&R (repeatability and reproducibility) of experiments in academic papers published in social science by researching and developing systems and methods for assigning confidence scores to specific findings published in the social science literature. The final products include a prototypical instantiation of the proposed system that functions within the CiteSeerX framework that also maintains explainability of its assertations. This project is supported by Defense Advanced Research Projects Agency (DARPA). It is a collaborative effort of Pennsylvania State University, Texas A&M University, Microsoft Research, and ODU. Key people include Dr. C. Lee Giles (PSU), Dr. Sarah Rajtmajer (PSU), Dr. Chris Griffin (PSU), Dr. Anna Squicciarini (PSU), James Caverlee (PSU), Xia (Ben) Hu (TAMU), Dr. Frank Shipman (TAMU), and David Pennock (Microsoft).

The ODU team works with the PSU team to perform information extraction from scholarly papers, including but not limited to header, citations, acknowledgement, domain knowledge entities, math expressions, and integrate them into the PDFMEF framework, which will be part of the final system.

Topological Relation-Based Image Analysis using Graphs

The goal of this project is to automate the understanding of technical content contained in scientific images. In particular, the goal is to track the spread of technical information by finding copies and modified copies of technical diagrams in patent databases; as well as to label electronic components within tomography images. These two applications share in common the property that shape and topology within the image are the most important features. Computer vision, especially through the use of machine learning methods, has dramatically improved the ability to detect objects in images and semantically segment images to automate labelling of pixel within an image. However, these advances have not yet automated the understanding of information contained in hand-drawn figures, technical diagrams, and imagery produced for scientific inquiry. The key innovation is the insight that these technical images carry little per-pixel information compared with the natural images (photographs and video), and that context, topology and shape provide information. By representing images as hierarchical graphs, with annotations on topological relationships, the project will model the context and knowledge necessary to perform semantic-level analysis of images. Key people include Dr. Diane Oyen (LANL), Dr. C. Lee Giles (PSU), and Dr. Jian Wu (ODU).

Currently, the ODU team is investigating the state-of-the-art techniques on image retrieval, focusing on image meta search (text-based search), and probing the feasibility to apply them to technical images and diagrams (as opposed to natural images). This project is supported by the Department of Energy (DoE) through the Los Alamos National Laboratory (LANL).

Publications

I have published 60+ peer-reviewed papers in ACM, IEEE, and AAAI conferences, magazines, and journals, in addition to earlier publications in astronomical journals.
  1. Jian Wu, Shaurya Rohatgi, Sai Raghav Reddy Keesara, Jason Chhay, Kevin Kuo, Arjun Manoj Menon, Sean Parsons, Bhuvan Urgaonkar, C. Lee Giles. Building an Accessible, Usable, Scalable, and Sustainable Service for Scholarly Big Data. In the 2021 IEEE International Conference on Big Data, (IEEE 2021 Big Data). Virtual Event. [bibtex]
  2. Sarah Rajtmajer, Christopher Griffin, Jian Wu, Robert Fraleigh, Laxmaan Balaji, Anna Squicciarini, Anthony Kwasnica, David Pennock, Michael McLaughlin, Timothy Fritton, Nishanth Nakshatri, Arjun Menon, Sai Ajay Modukuri, Rajal Nivargi, Xin Wei, C. Lee Gile. A Synthetic Prediction Market for Estimating Confidence in Published Work. In the 36th AAAI Conference on Artificial Intelligence, (AAAI 2022 Demo). Virtual Event.
  3. Chinmayee Rane, Seshasayee Mahadevan Subramanya, Devi Sandeep Endluri, C. Lee Giles, Jian Wu. ChartReader: Automatic Parsing of Bar-Plots. In the IEEE 22nd International Conference on Information Reuse and Integration for Data Science, (IRI 2021). Virtual Event. [bibtex]
  4. Athar Sefid, Jian Wu, Prasenjit Mitra, C. Lee Giles. Extractive Research Slides Generation Using Win- dowed Labeling Ranking. In the 2nd International Workshop on Scientific Document Processing, (SDP @ NAACL 2021). Virtual Event. [bibtex]
  5. Shaurya Rohatgi, Jian Wu, and C. Lee Giles. What Were People Searching For? A Query Log Analysis of An Academic Search Engine. In the 2021 ACM/IEEE Joint Conference on Digital Libraries, (JCDL 2021). Virtual Event. [bibtex]
  6. Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox. Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations.” In the 2021 ACM/IEEE Joint Conference on Digital Libraries, (JCDL 2021). Virtual Event. [bibtex]
  7. Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, Jian Wu. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. In: the 2021 ACM/IEEE Joint Conference on Digital Libraries, (JCDL 2021). Virtual Event. [bibtex]
  8. Meng Ling, Torsten Moller, Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Robert Laramee, Han-Wei Shen, Jian Wu, C. Lee Giles. Document Domain Randomization for Deep Learning Document Layout Extraction. In: the 2021 International Conference on Document Analysis and Recognition, (ICDAR 2021). Lausanne, Switzerland. [bibtex]
  9. Sree Sai Teja Lanka, Sarah Michele Rajtmajer, Jian Wu, C. Lee Giles. Extraction and Evaluation of Statistical Information from Social and Behavioral Science Papers. In: the 1st International Workshop on Scientific Knowledge, (Sci-K @ Web Conference 2021). Virtual Event. [bibtex]
  10. Wei Wang, Feng Xia, Jian Wu, Zhiguo Gong, Hanghang Tong, Brian D. Davison. Scholar2vec: Vector Representation of Scholars for Lifetime Collaborator Prediction. In: Transactions on Knowledge Discovery from Data, (TKDE). [bibtex]
  11. Sai Ajay Modukuri, Sarah Rajtmajer, Anna Cinzia Squicciarini, Jian Wu, C. Lee Giles. Understanding and Predicting Retractions of Published Work. In: The 1st AAAI Workshop on Scientific Document Understanding @ AAAI 2021, (SDU @ AAAI 2021), Virtual Event. [bibtex]
  12. Ming Gong, Xin Wei, Diane Oyen, Jian Wu, Martin Gryder, Liping Yang. Recognizing Figure Labels in Patents. In: The 1st AAAI Workshop on Scientific Document Understanding @ AAAI 2021, (SDU @ AAAI 2021), Virtual Event [bibtex]
  13. Yasith Jayawardana, Alexander C. Nwala, Gavindya Jayawardena, Jian Wu, Sampath Jayarathna, Michael L. Nelson, C. Lee Giles. Modeling Updates of Scholarly Webpages Using Archived Data. In: Proceedings of the Computational Archival Science Workshop @ BigData 2020, (CAS @ BigData 2020), Virtual Event. [bibtex]
  14. Bharath K. Kandimalla, Shaurya Rohatgi, Jian Wu, and C. Lee Giles. Large Scale Subject Category Classification of Scholarly Papers with Deep Attentive Neural Networks. In: Frontiers in Research Metrics and Analytics, section Text-mining and Literature-based Discovery, (Front. Res. Metr. Anal.), 2020, doi:10.3389/frma.2020.600382 [bibtex]
  15. Jian Wu, Pei Wang, Xin Wei, Sarah Rajtmajer, C Lee Giles, and Christopher Griffin. Acknowledgement Entity Recognition in CORD-19 Papers. In: Proceedings of the 1st Workshop on Scholarly Document Processing @ EMNLP 2020, (SDP @ EMNLP 2020), November 19, 2020. Virtual Event. [bibtex]
  16. Shaurya Rohatgi, Zeba Karishma, Jason Chhay, Sai Raghav Reddy Keesara, Jian Wu, Cornelia Caragea, and C Lee Giles. COVIDSeer: Extending the CORD-19 Dataset. In: Proceedings of the 20th ACM Symposium on Document Engineering, (DocEng 2020), September 29–October 2, 2020. Virtual Event. [bibtex]
  17. Shaurya Rohatgi, Jian Wu, and C. Lee Giles. PSU at CLEF-2020 ARQMath Track: Unsupervised Re-ranking using Pretraining In: 11th International Conference of the Conference and Labs of the Evaluation Forum (CLEF) Associaion (CLEF 2020), September 22-25, 2020, Virtual Event.
  18. Chenrui Guo, Haoran Cui, Li Zhang, Jiamin Wang, Wei Lu, and Jian Wu. SmartCiteCon: Implicit Citation Context Extraction from Academic Literature Using Supervised Learning In: Proceedings of the 8th International Workshop on Mining Scientific Publications (WOSP @ JCDL 2020), August 5, 2020, Virtual Event.
  19. Muntabir Hasan Choudhury, Jian Wu, William A. Ingram, Edward A. Fox. A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations. Poster in: Proceedings of the 2020 International Conference on Digital Libraries (JCDL 2020), August 1–5, 2020. Virtual Event. [Best Poster Honorable Mention] [bibtex]
  20. Jian Wu, Md Reshad Ul Hoque, Gunnar W. Reiske, Michele C. Weigle, Brenda T. Bradshaw, Holly D. Gaff, Jiang Li, and Chiman Kwan. A Comparative Study of Sequential Tagging Methods for Domain Knowledge Entity Recognition in Biomedical Papers. In: Proceedings of the 2020 Joint Conferences on Digital Libraries (JCDL 2020), August 1--5, 2020, Virtual Event, China. [bibtex]
  21. Jian Wu and C. Lee Giles. Scholarly Very Large Data: Challanges For Digital Libraries. In: Large Scale Networking Workshop on Huge Data, April 12--13, 2020, Chicago, IL, USA.
  22. Ruijing Yao, Linlin Hou, Yingchun Ye, Ou Wu, Ji Zhang, and Jian Wu. Method and Dataset Mining in Scientific Papers. Poster in: 2019 IEEE International Conference on Big Data (Bigdata 2019), December 9--12, 2019, Los Angeles, CA, USA. [bibtex]
  23. Athar Sefid, Jian Wu, Prasenjit Mitra, and C. Lee Giles. 2019. Automatic Slide Generation for Scientific Papers. In the 3rd International Workshop on Capturing Scientific Knowledge, Los Angeles, California, USA, November 19th, 2019. (SciKnow @ K-CAP 2019). 11–16. [bibtex]
  24. Md Reshad Ul Hoque, Jian Wu, Jiang Li, Chiman Kwan, Agnese Chiatti, and Dash Bradley. Searching for Evidence of Scientific News in Scholarly Big Data. In:Proceedings of the 10th International Conference on Knowledge Capture (K-CAP 2019), November 19-21, 2019, Marina del Rey, CA, USA. [bibtex]
  25. Jian Wu, Kunho Kim, and C. Lee Giles. CiteSeerX: 20-year Service for Scholarly Big Data. In: Proceedings of the 2019 Artificial Intelligence for Data Discovery and Reuse (AIDR 2019), May 13-15, 2019, Pittsburg, PA, USA. [bibtex]
  26. Nir Nissim, Aviad Cohen, Jian Wu, Andrea Lanzi, Lior Rokach, Yuval Elovici, and C. Lee Giles. Sec-Lib: Protecting Scholarly Libraries from Infected Papers Using Active Machine Learning Framework. In: IEEE Access. 2019. [bibtex]
  27. Behrooz Mansouri, Shaurya Rohatgi, Douglas Oard, Jian Wu, C. Lee Giles, and Richard Zanibbi. Tangent-CFT: an Embedding model for Mathematical Formulas. In: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR 2019), October 2-5, 2019, Santa Clara, CA, USA. [bibtex]
  28. Alexander G. Ororbia, Ankur Mali, Jian Wu, Scott O'Connell, William Dreese, David Miller, and C. Lee Giles. Learned Neural Iterative Decoding for Lossy Image Compression Systems. In: Proceedings of the 2019 Data Compression Conference (DCC 2019) , March 26-29, 2019, Snowbird, UT, USA. [bibtex]
  29. Athar Sefid, Jian Wu, Allen C. Ge, Jing Zhao, Lu Liu, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles. Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets. In: Proceedings of the 31st Innovative Applications of Artificial Intelligence Conference (IAAI 2019), January 29-31, 2019, Honolulu, Hawaii, USA. [bibtex]
  30. Jian Wu, Bharath Kandimalla, Shaurya Rohatgi, Athar Sefid, Jianyu Mao, C. Lee Giles. CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset. In: Proceedings of the 2018 IEEE International Conference on Big Data (BigData 2018), December 10-13, 2018, Seattle, WA, USA. [bibtex]
  31. Jian Wu, Athar Sefid, Allen C. Ge, and C. Lee Giles. A Supervised Learning Approach To Entity Matching Between Scholarly Big Datasets. In: Proceedings of the 9th International Conference on Knowledge Capture (K-CAP 2017), December 4-6, 2017, Austin, Texas, USA. [bibtex]
  32. Hung-Hsuan Chen, Jian Wu, C. Lee Giles. Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia. In: (meta)-data quality workshop part of the 21st International Conference on Theory and Practice of Digital Libraries (MDQual 2017), September 18-21, 2017, Thessaloniki, Greece. [bibtex]
  33. Yufeng Ma, Tingting Jiang, Chandani Shrestha, Edward A. Fox, Jian Wu, and C. Lee Giles. Scenarios for Advanced Services in an ETD Digital Library. In: the 20th international symposium on electronic theses and dissertations (ETD2017), August 7-9, 2017, Washington, DC, USA. [bibtex]
  34. Jian Wu, Sagnik Ray Choudhury, Agnese Chiatti, Chen Liang, and C. Lee Giles. HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities. In: Proceedings of ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), June 19-23, 2017, Toronto, Canada. [bibtex]
  35. Nir Nissim, Aviad Cohen, Jian Wu, Andrea Lanzi, Lior Rokach, Yuval Elovici, and Lee Giles. Scholarly Digital Libraries as a Platform for Malware Distribution. In: Proceedings of the 2nd Singapore Cyber Security R\&D Conference (SG-CRC 2017), Singapore. [bibtex]
  36. Jian Wu, Chen Liang, Huaiyu Yang, and C. Lee Giles. CiteSeerX data: semanticizing scholarly papers. In: Proceedings of the International Workshop on Semantic Big Data (SBD @ SIGMOD 2016), June 26-30, 2016, San Francisco, CA, USA. [bibtex]
  37. Kyle Williams, Jian Wu, Zhaohui Wu, and C. Lee Giles. Information Extraction for Scholarly Digital Libraries. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL 2016), Newark, NJ, USA. [bibtex]
  38. Cornelia Caragea, Jian Wu, Sujatha Das G., and C. Lee Giles. Document Type Classification in Online Digital Libraries. In: Proceedings of the 26th Innovative Applications of Artificial Intelligence Conference (IAAI 2016), February 12-17, 2016, Phoenix, AZ, USA. [bibtex]
  39. Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander Ororbia, Douglas Jordan, Prasenjit Mitra, and C. Lee Giles. CiteSeerX: AI in a Digital Library Search Engine. Artificial Intelligence Magazine (AI Magazine), 2015. [bibtex]
  40. Jian Wu and C. Lee Giles. Information Extraction for Scholarly Document Big Data. In: the 1st International Workshop on Capturing scientific knowledge (SciKnow 2015), October 7-10, 2015, Palisades, NY. [bibtex]
  41. Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams, Sagnik Ray Choudhury, Suppawong Tuarob, Cornelia Caragea, and C. Lee Giles. PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search. In: Proceedings of the 8th International Conference on Knowledge Capture (K-CAP 2015), October 7-10, 2015, Palisades, NY, USA. [Best Paper Nomination] [bibtex]
  42. Alexander G. Ororbia II, David Reitter, Jian Wu, and C. Lee Giles. Online Learning of Deep Hybrid Architectures for Semi-Supervised Categorization. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2015), September 7-11, 2015, Porto, Portugal. [bibtex]
  43. Alexander G. Ororbia II, Jian Wu, and C. Lee Giles. Big Scholarly Data in CiteSeerX: Information Extraction from the Web. In: the 2nd WWW Workshop on Big Scholarly Data: Towards the Web of Scholars (BigScholar 2015), Florence, Italy. [bibtex]
  44. Alexander G. Ororbia II, Jian Wu, and C. Lee Giles. CiteSeerX: Intelligent Information Extraction and Knowledge Creation from Web-Based Data. In: the 4th Workshop on Automated Knowledge Base Construction at NIPS 2014, (AKBC 2014), December 13, 2014, Montréal, Canada. [bibtex]
  45. Jian Wu, Alexander G. Ororbia II, Kyle Williams, Madian Khabsa, Zhaohui Wu and C. Lee Giles. Utility Based Control Feedback in A Digital Library Search Engine: Cases in CiteSeerX. In: The 9th USENIX International Workshop on Feedback Computing (Feedback 2014), Philladelphia, PA, USA. [bibtex]
  46. Jian Wu, Kyle Williams, Madian Khabsa and C. Lee Giles. The Impact of User Corrections on A Crawl-Based Digital Library: A CiteSeerX Perspective. In: The 10th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2014), Miami, Florida, USA. [Invited Paper] [bibtex]
  47. Kyle Williams, Jian Wu, and C. Lee Giles. SimSeerX: A Similar Document Search Engine. In: the 14th ACM Symposium on Document Engineering (DocEng 2014), September 16-19 2014, Fort Collins, CO, USA. [bibtex]
  48. Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Alexander G. Ororbia II, Douglas Jordan, and C. Lee Giles. CiteSeerX: AI in a Digital Library Search Engine. In: the 26th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI 2014<.b>), July 29-31, 2014, Québec City, Québec, Canada. [Best Application Paper] [bibtex]
  49. Kyle Williams, Jian Wu, Sagnik Choudhury, Madian Khabsa, and C. Lee Giles. Scholarly Big Data Information Extraction and Integration in the CiteSeerX Digital Library. In: the 10th International Workshop on Information Integration on the Web (IIWeb 2014), March 31--April 4, 2014, Chicago, IL, USA. [bibtex]
  50. Zhaohui Wu, Jian Wu, Madian Khabsa, Kyle Williams, Hung-Hsuan Chen, Wenyi Huang, Suppawong Tuarob, Sagnik Ray Choudhury, Alexander G. Ororbia II, Prasenjit Mitra, and C. Lee Giles. Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities. In: the Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (DL 2014), London, UK. [bibtex]
  51. Kyle Williams, Lichi Li, Madian Khabsa, Jian Wu, Patrick Shih, and C. Lee Giles. A Web Service for Scholarly Big Data Information Extraction. In: the 21th IEEE International Conference on Web Services (ICWS 2014), Anchorage, AK, USA. [bibtex]
  52. Jian Wu, Pradeep Teregowda, Kyle Williams, Madian Khabsa, Douglas Jordan, Eric Treece, Zhaohui Wu and C. Lee Giles. Migrating A Digital Library to A Private Cloud. In: Proceedings of the IEEE International Conference on Cloud Engineering 2014 (IC2E 2014), Boston, MA, USA. [Best Paper Nomination] [bibtex]
  53. Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernández-Ramírez, Hung-Hsuan Chen, Zhaohui Wu, and C. Lee Giles. CiteSeerX: A Scholarly Big Dataset. In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands. [bibtex]
  54. Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das G., Madian Khabsa, Pradeep Teregowda and C. Lee Giles. Automatic Identification of Research Articles from Crawled Documents. In: the 2014 Workshop on Web-scale Classification: Classifying Big Data from the Web (WSCBD 2014), New York City, NY, USA. [bibtex]
  55. Jian Wu, Pradeep Teregowda, Eric Treece, Madian Khabsa, Douglas Jordan, Stephen Carman, Prasenjit Mitra and C. Lee Giles. 2013. Scalability Bottlenecks of the CiteSeerX Digital Library Search Engine. In: the 10th International Workshop on Large-Scale and Distributed System for Information Retrieval (LSDS-IR 2013), Rome, Italy. [bibtex]
  56. Jian Wu, Pradeep Teregowda, Madian Khabsa, Stephen Carman, Douglas Jordan, Jose San Pedro Wandelmer, Xin Lu, Prasenjit Mitra, and C. Lee Giles. Web crawler middleware for search engine digital libraries: a case study for citeseerX. In: the Proceedings of the 12th international workshop on Web information and data management, (WIDM 2012), Maui, HI, USA. [bibtex]
  57. Jian Wu, Pradeep Teregowda, Juan Pablo Fernández-Ramírez, Prasenjit Mitra, Shuyi Zheng, and C. Lee Giles. The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists. In: Proceedings of the 3rd Annual ACM Web Science Conference (WebSci 2012), Evanston, IL, USA. [bibtex]
  58. Sumit Bhatia, Cornelia Caragea, Hung-Hsuan Chen, Jian Wu, Pucktada Treeratpituk, Zhaohui Wu, Madian Khabsa, Prasenjit Mitra and C. Lee Giles. Specialized Research Datasets in the CiteSeerX Digital Library. D-Lib Magazine, Vol.18, Number 7/8, 2012. [bibtex]

Publications in Astronomy & Astrophysics

  1. Daniel E. Vanden Berk, Sarah C. Wesolowski, Mary J. Yeckley, Joseph M. Marcinik, Jean M. Quashnock, Lawrence M. Machia, and Jian Wu. Extreme ultraviolet quasar colours from GALEX observations of SDSS DR14Q catalog. In: Monthly Notices of the Royal Astronomical Society (MNRAS), 2020. [accepted]
  2. B. Luo, W.N. Brandt, M. Eracleous, Jian Wu, P.B. Hall, A. Rafiee, D.P. Schneider, Jianfeng Wu. X-ray and multiwavelength insights into the inner structure of high-luminosity disc-like emitters. In: Monthly Notices of the Royal Astronomical Society (MNRAS), 2013, 429, 1479. [bibtex]
  3. Jian Wu, Daniel Vanden Berk, Dirk Grupe, Scott Koch, Jonathan Gelbord, Donald P. Schneider, Caryl Gronwall, Sarah Wesolowski, and Blair L. Porterfield. A Quasar Catalog with Simultaneous UV, Optical and X-ray Observations by Swift. In: Astrophysical Journal Supplement (ApJS), 2012, 201:10. [bibtex]
  4. Jian Wu, Jane C. Charlton, Toru Misawa, Michael Eracleous, and Rajib Ganguly. The Physical Conditions of the Intrinsic NV Narrow Aborption Line Systems of Three Quasars. In: The Astrophysical Journal (ApJ), 2010, 722:997. [bibtex]
  5. Jian Wu, D.E. Vanden Berk, W.N. Brandt, D.P. Schneider, R.R. Gibon, and Jianfeng Wu. Probing Origins of the CIVλ1549 and Fe~Kα Baldwin Effect. In: The Astrophysical Journal (ApJ), 2009, 702:767. [bibtex]
  6. Donald P. Schneider, Patrick B. Hall, Gordon T. Richards, Michael A. Strauss, Daniel E. Vanden Berk, Scott F. Anderson, W. N. Brandt, Xiaohui Fan, Sebastian Jester, Jim Gray, James E. Gunn, Mark U. SubbaRao, Anirudda R. Thakar, Chris Stoughton, Alexander S. Szalay, Brian Yanny, Donald G. York, Neta A. Bahcall, J. Barentine, Michael R. Blanton, Howard Brewington, J. Brinkmann, Robert J. Brunner, Francisco J. Castander, István Csábai, Joshua A. Frieman, Masataka Fukugita, Michael Harvanek, David W. Hogg, Zeljko Ivezic, Stephen M. Kent, S. J. Kleinman, G. R. Knapp, Richard G. Kron, Jurek Krzesinski, Daniel C. Long, Robert H. Lupton, Atsuko Nitta, Jeffrey R. Pier, David H. Saxe, Yue Shen, Stephanie A. Snedden, David H. Weinberg, and Jian Wu. The Sloan Digital Sky Survey Quasar Catalog. IV. Fifth Data Release. In: The Astronomical Journal (AJ), 2007, 134:102. [bibtex]
  7. Yu Lu, Tinggui Wang, Hongyan Zhou, Jian Wu. On the Selection Effect of Radio Quasars in the Sloan Digital Sky Survey. In: The Astronomical Journal (AJ), 2007, 133:1615. [bibtex]

Teaching

I have been teaching/co-teaching at least 6 different courses since 2014, including both undergraduate and graduate courses. I believe that teachers should be a director and an oracle of a class rather than an instructor. The teaching process should put students in the center. Teachers should stimulate (internally or externally) students and let them explore and exploit instead of waiting for the next lectures. A qualified teacher should be ready to answer any questions about this class from students. The synopses of courses I taught are outlined below.

CS418/518: Web Programming

This class will introduce the process of writing interactive web applications accessible through the WWW. Students will develop in the LAMP environment with ElasticSearch as the search platform. Emphasis will be on the integration of these components for a useful application, a search engine either based on semistructured or unstructured documents. Lectures will provide the overview of various concepts and the class will be centered around development of a semester-long project. Prerequisites include Web familiarity, programming knowledge, database, and search experience. The course will give best practice instruction and guidance in developing a website with searching functions using a LAMP stack, HTML, Javascript, PHP, and MySQL, along with other more modern technologies, languages, and systems. The course will require students to use Git for version control via GitHub and project submission.

This course was offered by me at ODU CS in Fall 2021 (syllabus), Spring 2021, Fall 2020, and Fall 2019.

CS450/550: Database Concepts

This course aims to prepare Computer Science and Cybersecurity students for obtaining a fundamental understanding of the relational database concepts and practical skills to analyze and implement a well-defined database design. In particular, CS450/550 provides an introduction to physical database design, data modeling, relational model, logical database design, SQL query language, and instructors’ choices on database applications and advanced concepts. Students will learn to use a real-world open-source database management system. Upon taking CS450/550, students should be able to understand the implications and future directions of databases and database technologies.

This course is offered by me at ODU CS in Spring 2022 (syllabus) and Spring 2021.

CS734/834: Introduction to Information Retrieval

This class will explore the theory and engineering of information retrieval in the context of developing web-based search engines. The course will explore topics related to crawling, ranking, query processing, retrieval models, evaluation, clustering, and other aspects related to building search engines. The course will also cover recently established ranking algorithms that incorporate semantic similarities, machine learning, and neural network methods, such as learning to rank and neural information retrieval. The class will feature several hands-on development and coding using tools such as Google Custom Search, ElasticSearch, as well as a theoretical exploration of the existing literature on these topics. An external speaker will also be invited to give a talk on contemporary search engine and related topics (depending on availability). Students must be comfortable with self-directed learning appropriate for an advanced graduate class.

This course was offered by me at ODU CS in Fall 2021 (syllabus) and Spring 2019.

CS795/895: Mining Scholarly Big Data

One of the computer science subject areas that are the most impacted by artificial intelligence in the last decade is natural language processing (NLP). This technology further leads to advancements for machines to read, understand, and write textual content.

This seminar is designed to use textual content in scientific documents as an example to train graduate students effective and efficient ways to process text and extract statistical, syntactical, and semantic features from free text. The other half of the seminar will cover contemporary research topics in scholarly big data, an instance of big data, and more broadly text mining. The course will introduce commonly used machine learning (ML), NLP, and information retrieval (IR) tools as a preparation for a course project.

This course was offered by me at ODU CS in Fall 2020 (syllabus) and Fall 2018.

CS795/895: Deep Learning for Natural Language Understanding

Over the past two decades, with the advent and prevalence of GPUs and recently adopted TPUs, deep learning has made significant revolutionary advances, making remarkable progress on state-of-the-art tasks in traditional natural language processing (NLP) and computer vision (CV). In this background, a new subject field called natural language understanding (NLU) emerged out of and has received much attention by both academia and industrial researchers. The core task of NLU is to tackle fundamental challenges to train and test computer algorithms that effectively and efficiently represent human language by data structures that are processable by computers and to build artificial intelligent (AI) systems to mimic human’s ability to interpret and generate human languages.

The subject covers many emerging research topics. Some have made substantial progress over the past decade (such as building pre-trained language models) and some are still challenging (such as automatically generating coherent abstract summaries). This topical course is designed for graduate students to learn fundamental concepts and algorithms of deep learning and to explore important research topics in NLU including contextual representation models, grounded language understanding, natural language reference, supervised sentiment analysis, neural information retrieval, relation extraction with distant supervision, semantic parsing. The course will also introduce representative benchmark datasets and evaluation metrics.

This course is offered by me at ODU CS in Spring 2022 (syllabus).

IST210: Organization of Data

As the database management software becomes one of the critical components in modern IT applications and systems, a solid understanding of the fundamental knowledge on the design and management of data is required for virtually any IT professionals. In a business setting, such IT professionals should be able to talk to the clients to derive right requirements for database applications, ask the right questions about the nature of their entities and in-between relationships in their business scenarios, analyze and develop an effective and robust design to address business constraints, and react to the existing database designs as new needs arise. Solid understanding of the underlying data models and design issues in data applications are also critical for SRA (Security and Risk Analysis), Cyber-security students to ensure secure access to an intelligent analysis of data in complex business settings. Modern IT professionals should be able to guide a company in the best use of the diverse database-related technologies and applications for the “Big Data” era.

As such, IST 210 aims to prepare students for obtaining a fundamental understanding of the concepts and practical skills to analyze and implement a well-defined relational database design. In particular, IST 210 provides an introduction to physical database design, data modeling, relational model, logical database design, SQL query language, and instructors’ choices on database applications and advanced concepts. Students will learn to use a real-world commercial or open-source database management system, too. Upon taking IST 210, students should be able to understand the implications and future directions of databases and database technologies.

This course was offered by me at Penn State IST in Spring 2017, Fall 2017, Sprng 2018.

IST140: Introduction to Application Development

This is a first course in programming principles for application development. The course will focus on application development foundations including: fundamental programming concepts; basic data types and data structures; problem solving using programming; basic testing and debugging; basic computer organization and architecture; and fundamentals of operating systems. This is a hands-on course designed to help students learn to program a practical application using modern, high-level languages.

This course was offered by me at Penn State IST in Fall 2017, Sprng 2018.

IST441: Information Retrieval and Search Engines

This course is intended to prepare students to understand, design, develop and use information retrieval and search systems. The course will cover: organization, representation, and access to information; categorization, indexing, and content analysis; data structures for unstructured data; design and maintenance of such data structures, indexing and indexes, retrieval and classification schemes; use of codes, formats, and standards; analysis, construction and evaluation of search and navigation techniques; and search engines and how they relate to the above. Students will build a specialty web search engine using open source web tools and focused web crawling.

I co-taught this course with Dr. C. Lee Giles at Penn State IST in Sprng 2015, Spring 2016, Spring 2017. The official course page of IST441 is here.

The LAMP-SYS Lab

Since the beginning of the 21st century, the computer and information science has witnessed rapid and unprecedented advances in artificial intelligence (AI), represented by the prosperity of machine learning and deep learning algorithms. However, many of these algorithms and models are limited to lab experiments. In real-world problems, data are often noisy, contaminated, and deficient. Usually, a single model is not sufficient to meet specific requirements, which calls for systems consisting of multiple components.

The mission of the lab is to apply machine learning and deep learning techniques in real-world problems, focusing on building systems to solve multidisciplinary problems using building blocks in natural language processing, scholarly big data, digital libraries, and information retrieval.

The lab logo is a lighthouse drawn by a child. We are viewing the world with our naked eyes like little children, attempting to represent it using sketchy strokes and simple colors.

Lab Logo
Courtesy: Joseph Wu

Current Lab Members

Muntabir Choudhury

Muntabir started working with Dr. Wu in the fall of 2019. He obtained his Bachelor's degree in Elizabethtown College in Pennsylvania and then worked as an engineer at Resource9 Inc. at New York City. Muntabir is pursuing a PhD degree and a graduate research assistant. He works on a project collaborated with Virginia Tech to mine information from scanned electronic theses and disserations (ETDs).

Muntabir Choudhury

Xin Wei

Xin is a PhD student of computer science. She started working with Dr. Wu since summer 2020. Previously she was working with Dr. Cong Wang on cybersecurity. Xin's research focuses on extracting semantic information from scientific papers. She has participated in and then led the information extraction effort for the SCORE project and the semantic information extraction from US design patent.

Xin Wei

Kehinde Ajayi

Kehinde (Kenny) started his PhD with Dr. Wu in Spring 2021. His research focused on accurately extracting data from scientific tables. Kenny participated in the project to build a large-scale patent image dataset. He interned at Microsoft as a Data Scientist in summer 2022.

Kehinde Ajayi

Lamia Salsabil

Lamia started her PhD from Spring 2021. Her research is focused on investigating computational reproducibility of research papers. Lamia also participated in a project to improve the metadata quality of electronic theses and dissertations.

Lamia Salsabil

Previous Lab Members

Pei Wang

Pei started working with Dr. Wu in the spring of 2020. He obtained his Bachelor's degree in Beihang Uniersity in Beijing, China and then worked as a commentator for VSPN. He worked with Dr.Wu as a graduate research assistant for one year on acknowledgement extraction and then transferred to Virginia Tech. After obtaining a master's degree, he was hired by Microsoft as a Data Scientist.

Pei Wang

Recruitment

I am actively recuiting undergraduate and graduate students to join my lab. Below are opening positions.

Recuiting Students (Undergraduate or Graduate) for Paid Tasks on Image Annotation (Training Provided, No Major Requirement)

Task Description:

Dr. Jian Wu, assistant professor of Computer Science, in collaboration with Dr. Diane Oyen at the Los Alamos National Laboratory (LANL), is recruiting 8--10 undergraduate or graduate students for paid tasks to annotate a set of figure images extracted from patent documents. The goal of this project is to build a human labeled dataset to train computer algorithms to automatically annotate figure images. The annotations includes labeling figure types, identifying text labels, and extracting specific text spans from figure captions. The annotations can be done online using a web portal that has been developed for this task. Participants will receive a short training course (30 min) before they could start working.

Requirements:

  1. An undergrduate or a graduate student of any major.
  2. A cumulative GPA of 2.5 and above.
  3. Having access to a personal computer or a lab computer.
  4. Being able to see images and read text on computer monitors.
  5. Understanding the basics of how to use search engines and web-based applications.
  6. Being able to commit 5-10 hrs/week on this task for 3 weeks (tentatively starting from March 21st, 2022).
  7. Eligible to work in ODU.

Compensation:

Participants will be compensated at a rate of $10/hour. The payment will be processed through Old Dominion University Research Foundation.

Tentative Schedule:

Participants will receive an online training between March 14 and March 18, 2022. After that, the annotation will start on March 21st, 2022 and lasts for 3-4 weeks.

Application:

To apply for this task, please fill out the Google form. The application will close when the desired number is reached. For any questions, please contact Dr. Jian Wu (j1wu@odu.edu).

Undergraduate students for Research in Data Science

The LAMP-SYS lab at ODU is recruiting motivated undergraduate students enrolled in the Computer Science program to participate in research projects about CiteSeerX under the mentoring of Dr. Jian Wu. CiteSeerX is a digital library search engine providing over 10 million academic documents online. One project (PDFMEF) will develop scalable and customizable information extraction software that can process millions of PDF documents in a timely manner. The other project (Online Voting) involves frontend design to facilitate evaluation of multiple keyphrase extraction models with crowdsourcing. Either project will last for for the summer and the fall semesters. The basic requirements of qualified candidates include:

Basic requirements:

  1. CS, ECE or related majors. Senior students in the Linked Program are welcome to apply.
  2. A cumulative GPA of 3.25 and above.
  3. Familiar with Python or Java programming (for the PDFMEF project).
  4. Familiar with JavaScript and web programming (for the Online Voting project).
The followings are not required but meeting all of part of the following conditions is a plus.
  1. Experience of relational database such as MySQL.
  2. Experience of Linux (Unix) environment.
  3. Research experience with publications in conferences/workshops/symposiums (any form).
  4. Experiences with search engines.
  5. Experiences with big data.
Availability of compensation is subject to the approval of intramural grant application. If needed, the student will be provided a desktop computer and remote access to servers. Research projects will be disseminated as conference proceedings, journal articles, or posters. Substantial contributors will be co-authored. All questions can be directed to Dr. Jian Wu jwu@cs.odu.edu. Recruitment will close when the position is filled.

PhD on Applied Machine Learning and Natural Language Processing Systems [Closed]

The LAMP-SYS Lab at the Computer Science Department at the Old Dominion University at Norfolk, VA, USA is recruiting a fully supported PhD student to conduct research on Applied Machine Learning and Natural Language Processing Systems. The student will work with Dr. Jian Wu, assistant professor of Computer Science, on mining scholarly big data and digital libraries. The project will leverage cutting-edge technologies in machine learning, deep learning, natural language processing, and big data on information extraction, classification, and retrieval from scholarly big data corpora, including but not limited to electronic theses and dissertations (ETDs), research papers, news articles, and Wikipedia articles. Specific tasks include but not limited to typed-entity and relation extraction, citation graph generation and analysis, building search engine systems, developing multiclass and multilabel classification models, and applying word-embedding on text retrieval and summarization tasks. The lab directed by Dr. Jian Wu will closely collaborate with the DLRL group at Virginia Tech and the CiteSeerX group at the Pennsylvania State University in form of data and software sharing and online meetings.

The requirements include the following:

  1. The applicant shall have a Bachelor's degree in at least one of the following subject domains: Computer Science, Information Science, Mathematics, Statistics, Physics, Astronomy & Astrophysics. A Master's degree in above fields is a preferred.
  2. The applicant should have solid programming and algorithmic skills, preferentially in Python. Experience with packages such as scikit-learn, tensorflow, and/or NLTK is plus.
  3. The applicant should have relevant experiences or publications at least one of the following subjects: database systems, information retrieval, natural language processing, machine learning, and big data.

To be considered for this position, please email Dr. Jian Wu (jwu@cs.odu.edu) the following materials:

  1. Personal Statement: a 1-2 page document highlighting the motivation to join the PhD program and any of the projects mentioned above. The statement letter should also mention any contributions to the publication provided (if any).
  2. Curriculum Vitae
  3. Academic Transcript(s): preferentially official but non-official ones are acceptable.
  4. Publications: A copy of previously published papers if any.
  5. Three Reference letters: optional. Required in official application.

Please note that submitting the above documents does not constitute a full application for admission. The applicants may be asked to provide additional documents and materials required by the ODU graduate school (see below) if they are encouraged to apply. Recruitment will close when the position is filled.

About Me

Dr. Jian Wu is an assistant professor in the Computer Science Department at the Old Dominion University (ODU), Norfolk, Virginia, United States. He is the tech leader of the CiteSeerX project, directed by Dr. C. Lee Giles. He is a member of the Web Science and Digital Libraries Research Group (WS-DL). He directs the Lab for Applied Machine Learning and Natural Language Processing Systems (LAMP-SYS) at ODU. Before joining ODU, Dr. Jian Wu was an assistant teaching professor in the College of Information Sciences and Technology (IST) at the Pennsylvania State University.

Dr. Jian Wu received his bachelor's degree in 2004 from the University of Science and Technology of China (USTC) in Physics and Astronomy. He obtained his Ph.D. degree from the Department of Astronomy and Astrophysics at the Pennsylvania State University in August 2011. After that, he joined the CiteSeerX team led by Dr. C. Lee Giles. Jian Wu is the tech leader of the CiteSeerX project. He led a small team to scale the CiteSeerX collection from 3 million to 10 million academic documents from 2015 to 2018. Dr. Jian Wu is the Co-PI of an NSF supported project to build a scalable and sustainable CiteSeerX to support the scholarly big data in the long term. Dr. Jian Wu has published 48 peer-reviewed papers in ACM, IEEE, AAAI conferences, journals, and magazines, as of October 2020, including best paper award and nominations. Dr. Jian Wu also published 7 journal articles in astronomical journals in his early career.

Dr. Jian Wu's collaborators include but not limited to Dr. Michael Nelson, Dr. Michele Weigle, Dr. Sampath Jayarathna at ODU CS, the CiteSeerX team directed by Dr. C. Lee Giles at the Pennsylvania State University, the Digital Library Research Laboratory directed by Dr. Ed Fox at Virginia Tech, Dr. Diane Oyen's group at the Los Alamos National Laboratory (LANL), the Document and Pattern Recognition Lab directed by Dr. Richard Zanibbi at Rochester Institute of Technology (RIT), University of Chicago at Illinois (UIC), and National Singapore University (NUS). Dr. Wu's research is supported by the NSF, IMLS, DARPA, and DoE.

Dr. Wu's curriculum vitae can be downloaded here.

ORCID: 0000-0003-0173-4463.

Office: 3202 ECSB, Old Dominion University, Norfolk, VA, 23529.

Office phone: +1(757)683-7753.

Email: jwu at cs dot odu dot edu.