#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include using namespace std; namespace fs=filesystem; const uint64_t CCC_DELIMITER_0_HEAD=0b0; const uint64_t CCC_DELIMITER_1_HEAD=0b10; const uint64_t CCC_C_KEYWORD_HEAD=0b1100; const uint64_t CCC_MISCELANEOUS_HEAD=0b1101; const uint64_t CCC_STRING_INLINE_HEAD=0b1110; const uint64_t CCC_REC_TABLE_REF_HEAD=0b1111; const uint64_t CCC_STRING_INLINE_END=0b00000000; #define CCC_ADD_COMPONENT(vec,tail) \ do { \ auto tmp=tail; \ vec.insert(vec.end(),tmp.begin(),tmp.end()); \ } while (0) struct XXH3HasherString { size_t operator()(const std::string& s) const { return static_cast(XXH3_64bits(s.data(),s.size())); } }; class bit_streamer { private: vector out; uint8_t current_byte=0; uint8_t bit_pos=0; public: size_t index; bit_streamer(size_t index) { out.reserve(1024*1024); this->index=index; } size_t get_size() { return out.size(); } void write_bits(uint64_t value,uint8_t count) { for (int i=count-1;i>=0;--i) { if ((value>>i) & 1) { current_byte|=(1<<(7-bit_pos)); } bit_pos++; if (bit_pos==8) { out.push_back(current_byte); current_byte=0; bit_pos=0; } } } void align() { if (bit_pos>0) { out.push_back(current_byte); current_byte=0; bit_pos=0; } } const vector& get_out() const { return out; } vector extract_buffer() { align(); return std::move(out); } }; const vector delimiter0={ "{", "}", "(", ")", "[", "]", ",", "." }; const vector delimiter1={ "{}", "()", "[]", ";" }; const vector miscellaneous={ "!", "%", "'", "*", "+", "-", "/", ":", "<", ">", "=", "?", "^", "|", "&", "~", "+=", "-=", "*=", "/=", "%=", "&=", "|=", "^=", "<<=", ">>=", "++", "--", "<<", ">>", "==", "!=", "<=", ">=", "->", "...", "||", "&&", "NULL", "size_t", "uint8_t", "uint16_t", "uint32_t", "uint64_t", "int8_t", "int16_t", "int32_t", "int64_t" }; const vector c_keywords={ "#if", "#ifdef", "#ifndef", "#else", "#elif", "#elifdef", "#elifndef", "#endif", "#define", "#undef", "#include", "#error", "#warning", "#pragma", "#line", "alignas", "alignof", "auto", "bool", "break", "case", "char", "const", "constexpr", "continue", "default", "do", "double", "else", "enum", "extern", "false", "float", "for", "goto", "if", "inline", "int", "long", "nullptr", "register", "restrict", "return", "short", "signed", "sizeof", "static", "static_assert", "struct", "switch", "thread_local", "true", "typedef", "typeof", "typeof_unequal", "union", "unsigned", "void", "volatile", "while", "__asm__", "__attribute__", "defined", }; #pragma pack(push,1) struct header { uint8_t sig[3]; uint8_t flags; size_t size_rec_table; size_t entry_count; size_t size_payload; }; #pragma pack(pop) struct node { uint32_t type; uint32_t start; uint32_t end; }; struct file_entry { string name; string content; size_t size; size_t index; }; struct thread_iterate_input_loop_call { string &source_code; vector &thread_local_node_list; map& thread_local_rec_map; }; struct thread_rec_map_result { map thread_local_rec_map; }; struct thread_encoding_input_loop_call { string &source_code; vector node_list; bit_streamer& thread_local_bit_stream; }; struct thread_encoding_result { vector encoded_files; }; queue rec_map_files_queue; mutex rec_map_queue_mutex; queue encoding_files_queue; mutex encoding_queue_mutex; mutex filename_nodes_mutex; vector rec_list; unordered_map rec_lookup; unordered_map> c_keyword_lookup; unordered_map> miscelaneous_lookup; unordered_map> delimiter0_lookup; unordered_map> delimiter1_lookup; unordered_map type_to_id; vector id_to_type; unordered_map,XXH3HasherString> filename_to_node_list; bool show_warning=false; bool fail_on_warning=false; bool enable_malloc_trim=true; mutex type_alloc; uint32_t get_id(const string& type) { { auto it=type_to_id.find(type); if (it!=type_to_id.end()) return it->second; } lock_guard lock(type_alloc); auto it=type_to_id.find(type); if (it!=type_to_id.end()) return it->second; uint32_t id=id_to_type.size(); type_to_id[type]=id; id_to_type.push_back(type); return id; } uint32_t ID_STRING_CONTENT,ID_SYSTEM_LIB_STRING,ID_IDENTIFIER,ID_NUMBER_LITERAL,ID_TYPE_IDENTIFIER,ID_FIELD_IDENTIFIER,ID_ESCAPE_SEQUENCE,ID_STATEMENT_IDENTIFIER,ID_PRIMITIVE_TYPE,ID_COMMENT,ID_PREPROC_ARG,ID_CHARACTER,ID_PREPROC_DIRECTIVE; uint32_t ID_LEFT_PAR,ID_RIGHT_PAR,ID_LEFT_CROCHET,ID_RIGHT_CROCHET,ID_LEFT_ACC,ID_RIGHT_ACC,ID_QUOTE; void iterate_all_nodes_loop_call(thread_iterate_input_loop_call &settings,TSNode current_node) { if (ts_node_child_count(current_node)==0) { uint32_t start=ts_node_start_byte(current_node); uint32_t end=ts_node_end_byte(current_node); string_view text{settings.source_code.data()+start,end-start}; string type=string(ts_node_type(current_node)); if (type=="string_content" || type=="system_lib_string" || type=="identifier" || type=="number_literal" || type=="type_identifier" || type=="field_identifier" || type=="escape_sequence" || type=="statement_identifier") { settings.thread_local_rec_map[string(text)]++; } if (type=="primitive_type" && find(c_keywords.begin(),c_keywords.end(),text)==c_keywords.end()) { settings.thread_local_rec_map[string(text)]++; } if (type=="comment") { settings.thread_local_rec_map[string(text)]=2; } settings.thread_local_node_list.push_back({.type=get_id(type),.start=start,.end=end}); } else { uint32_t child_count=ts_node_child_count(current_node); for (uint32_t i=0;i node_vector={}; file_entry f; { lock_guard lock(rec_map_queue_mutex); if (rec_map_files_queue.empty()) break; f=std::move(rec_map_files_queue.front()); rec_map_files_queue.pop(); } thread_iterate_input_loop_call loop_settings { .source_code=f.content, .thread_local_node_list=node_vector, .thread_local_rec_map=res.thread_local_rec_map }; TSTree *tree=ts_parser_parse_string(parser,nullptr,f.content.c_str(),f.content.size()); TSNode root=ts_tree_root_node(tree); loop_settings.source_code=f.content; iterate_all_nodes_loop_call(loop_settings,root); ts_tree_delete(tree); { lock_guard lock(filename_nodes_mutex); filename_to_node_list[f.name]=std::move(node_vector); } { lock_guard lock(encoding_queue_mutex); encoding_files_queue.push(std::move(f)); } if (++counter%20==0 && enable_malloc_trim) malloc_trim(0); } ts_parser_delete(parser); auto end=chrono::high_resolution_clock::now(); auto ms=chrono::duration_cast(end-start).count(); cout<<"Recccurences map thread number "<>=1; ++bits; } bitstream.align(); bitstream.write_bits(CCC_REC_TABLE_REF_HEAD,4); bitstream.write_bits(index,bits); bitstream.align(); return; } void generate_delimiter0(bit_streamer& bitstream,size_t index) { bitstream.align(); bitstream.write_bits(CCC_DELIMITER_0_HEAD,1); bitstream.write_bits(index,3); bitstream.align(); return; } void generate_delimiter1(bit_streamer& bitstream,size_t index) { bitstream.align(); bitstream.write_bits(CCC_DELIMITER_1_HEAD,2); bitstream.write_bits(index,2); bitstream.align(); return; } void generate_miscellaneous(bit_streamer& bitstream,size_t index) { bitstream.align(); bitstream.write_bits(CCC_MISCELANEOUS_HEAD,4); bitstream.write_bits(index,6); bitstream.align(); } void generate_string_content(bit_streamer& bitstream,const char *text,size_t text_len) { bitstream.align(); bitstream.write_bits(CCC_STRING_INLINE_HEAD,4); for (int i=0;isecond; generate_rec(out,index,rec_list.size()); } } else if (type==ID_PRIMITIVE_TYPE || type==ID_TYPE_IDENTIFIER) { auto it=c_keyword_lookup.find(string(text)); if (it!=c_keyword_lookup.end()) { size_t index=it->second; generate_c_keyword(out,index); } else { auto it=rec_lookup.find(string(text)); if (it==rec_lookup.end()) { if (!text.empty()) { generate_string_content(out,text.data(),text.size()); } else { print_warning("Warning: type node is empty: "+string(text)); fail_if_warning(); } } else { size_t index=it->second; generate_rec(out,index,rec_list.size()); } } } else if (delimiter0_lookup.find(id_to_type[type])!=delimiter0_lookup.end() || delimiter1_lookup.find(id_to_type[type])!=delimiter1_lookup.end() || type==ID_QUOTE) { string insert; if (type==ID_LEFT_PAR && i+1second; generate_delimiter0(out,index); } else { if (insert!="{}" && insert!="\"") { auto it=delimiter1_lookup.find(insert); if (it!=delimiter1_lookup.end()) { size_t index=it->second; generate_delimiter1(out,index); } else { print_warning("Warning: unknow delimiter, that shouldn't happen: "+insert); fail_if_warning(); } } else { if (insert=="{}") { auto it=delimiter1_lookup.find("{}"); if (it!=delimiter1_lookup.end()) { size_t index=it->second; out.align(); out.write_bits(CCC_DELIMITER_1_HEAD,2); out.write_bits(index,2); out.write_bits(0b0,1); out.align(); } else { print_warning("Warning: unknow delimiter, that shouldn't happen: "+insert); fail_if_warning(); } } else if (insert=="\"") { auto it=delimiter1_lookup.find("{}"); if (it!=delimiter1_lookup.end()) { size_t index=it->second; out.align(); out.write_bits(CCC_DELIMITER_1_HEAD,2); out.write_bits(index,2); out.write_bits(0b1,1); out.align(); } else { print_warning("Warning: unknow delimiter, that shouldn't happen: "+insert); fail_if_warning(); } } else { print_warning("Warning: unknow delimiter, that shouldn't happen: "+insert); fail_if_warning(); } } } } else if (c_keyword_lookup.find(id_to_type[type])!=c_keyword_lookup.end() || type==ID_PREPROC_DIRECTIVE) { if (type!=ID_PREPROC_DIRECTIVE) { auto it=c_keyword_lookup.find(id_to_type[type]); if (it!=c_keyword_lookup.end()) { size_t index=it->second; generate_c_keyword(out,index); } else { print_warning("Warning: unknow C keyword, that shouldn't happen: "+id_to_type[type]+" "+string(text)); fail_if_warning(); } } else { auto it=c_keyword_lookup.find(string(text)); if (it!=c_keyword_lookup.end()) { size_t index=it->second; generate_c_keyword(out,index); } else { auto it=rec_lookup.find(string(text)); if (it==rec_lookup.end()) { if (!text.empty()) { generate_string_content(out,text.data(),text.size()); } else { print_warning("Warning: C keyword is empty: "+string(text)); fail_if_warning(); } } else { size_t index=it->second; generate_rec(out,index,rec_list.size()); } } } } else if (miscelaneous_lookup.find(id_to_type[type])!=miscelaneous_lookup.end()) { auto it=miscelaneous_lookup.find(id_to_type[type]); if (it!=miscelaneous_lookup.end()) { size_t index=it->second; generate_miscellaneous(out,index); } else { print_warning("Warning: unknow miscellaneous, that shouldn't happen: "+id_to_type[type]); fail_if_warning(); } } else if (type==ID_COMMENT) { auto it=rec_lookup.find(string(text)); if (it==rec_lookup.end()) { if (!text.empty()) { generate_string_content(out,text.data(),text.size()); } else { print_warning("Warning: comment is empty: "+string(text)); fail_if_warning(); } } else { size_t index=it->second; generate_rec(out,index,rec_list.size()); } } else { auto it=rec_lookup.find(id_to_type[type]); if (it==rec_lookup.end()) { if (!text.empty()) { generate_string_content(out,text.data(),text.size()); } else { print_warning("Warning: unknow node is empty: "+string(text)); fail_if_warning(); } } else { size_t index=it->second; generate_rec(out,index,rec_list.size()); } } } out.align(); return; } thread_encoding_result run_thread_encoding(size_t thread_num) { auto start=chrono::high_resolution_clock::now(); thread_encoding_result res; vector thread_local_encoded_files; int counter=0; int max=0; while (true) { file_entry f; { lock_guard lock(encoding_queue_mutex); if (encoding_files_queue.empty()) break; f=std::move(encoding_files_queue.front()); encoding_files_queue.pop(); } thread_local_encoded_files.emplace_back(f.index); thread_encoding_input_loop_call encoding_loop_settings { .source_code=f.content, .node_list=std::move(filename_to_node_list[f.name]), .thread_local_bit_stream=thread_local_encoded_files[counter] }; encoding_loop_settings.source_code=f.content; process_file_nodes_loop_call(encoding_loop_settings); vector().swap(encoding_loop_settings.node_list); string().swap(f.content); if (++counter%20==0 && enable_malloc_trim) malloc_trim(0); } res.encoded_files=std::move(thread_local_encoded_files); auto end=chrono::high_resolution_clock::now(); auto ms=chrono::duration_cast(end-start).count(); cout<<"Parsing/encoding thread number "< files; for (int i=1;i(file)),istreambuf_iterator()); file_entry f{files[i],std::move(code),code.size()}; f.index=i; rec_map_files_queue.push(std::move(f)); } size_t nb_threads=thread::hardware_concurrency(); size_t total_files=files.size(); vector> rec_map_futures; for (size_t i=0;i all_rec_map_results; map global_rec_map; for (auto& fut:rec_map_futures) { all_rec_map_results.push_back(fut.get()); for (auto const& [str,count]:all_rec_map_results.back().thread_local_rec_map) { global_rec_map[str]+=count; } } for (auto const& [str,count]:global_rec_map) { if (count>=2 && str.size()>=3) { rec_list.push_back(str); rec_lookup[str]=rec_list.size()-1; } } global_rec_map.clear(); vector encoding_files_vec; while (!encoding_files_queue.empty()) { encoding_files_vec.push_back(std::move(encoding_files_queue.front())); encoding_files_queue.pop(); } sort(encoding_files_vec.begin(),encoding_files_vec.end(),[](const file_entry& a,const file_entry& b) { return a.size>b.size; }); for (auto& f:encoding_files_vec) { encoding_files_queue.push(std::move(f)); } vector> encoding_futures; for (size_t i=0;i all_encoding_results; for (auto& fut:encoding_futures) { all_encoding_results.push_back(fut.get()); } vector globals_bit_stream; for (auto& res:all_encoding_results) { globals_bit_stream.insert(globals_bit_stream.end(),res.encoded_files.begin(),res.encoded_files.end()); } sort(globals_bit_stream.begin(),globals_bit_stream.end(),[](const bit_streamer& a,const bit_streamer& b) { return a.index final_payloads; vector global_payloads_start; size_t total_size2=0; for(auto& bstr:globals_bit_stream) total_size2+=bstr.get_size(); final_payloads.reserve(total_size2); size_t current_offset=0; for (auto& bstr:globals_bit_stream) { global_payloads_start.push_back(current_offset); auto encoded_file=std::move(bstr.extract_buffer()); final_payloads.insert(final_payloads.end(),encoded_file.begin(),encoded_file.end()); current_offset+=encoded_file.size(); } // // Payload compression // vector payload_compressed; cout<<"Files Payloads (in bytes): "<(end-start).count(); if (ret!=LZMA_STREAM_END) { cout<<"Error: couldn't compress files archive."<=original_size) { flags&= ~(0b00000001); payload_total_size=original_size; vector().swap(payload_compressed); } else { flags|=0b00000001; payload_total_size=compressed_size; vector().swap(final_payloads); } lzma_end(&strm); // // Rec table compression // vector rec_table; for (size_t i=0;i rec_table_compressed; cout<<"Reccurences table (in bytes): "<(end-start).count(); if (ret!=LZMA_STREAM_END) { cout<<"Error: couldn't compress reccurences table."<=original_size) { flags&= ~(0b00000010); rec_table_total_size=original_size; vector().swap(rec_table_compressed); } else { flags|=0b00000010; rec_table_total_size=compressed_size; vector().swap(rec_table); } lzma_end(&strm2); // // Files table // vector files_table; for (int i=0;i files_table_compressed; files_table_compressed.resize(files_table.size()+files_table.size()/3+128); strm=LZMA_STREAM_INIT; if (lzma_easy_encoder(&strm,9,LZMA_CHECK_CRC64)!=LZMA_OK) { cout<<"Error: couldn't initialize LZMA compressor for files table."<=original_size) { flags&= ~(0b00000100); files_table_total_size=original_size; vector().swap(files_table); } else { flags|=0b00000100; files_table_total_size=compressed_size; vector().swap(files_table_compressed); } header head; head.sig[0]='C'; head.sig[1]='C'; head.sig[2]='C'; head.flags=flags; head.size_payload=payload_total_size; head.size_rec_table=rec_table_total_size; head.entry_count=files.size(); vector out; for (int i=0;i(out.data()),out.size()); fileout.close(); cout<<"Finished !"<